// library(ascii) // setwd("c://aaaWork//web//fishR//bookex//AIFFD//Preliminaries") // Sweave("Preliminaries.Rnw",driver=RweaveAsciidoc) AIFFD Preliminaries =================== :Author: Derek H. Ogle :Email: dogle@northland.edu :Date: 18-June-2009 :Revision: 1 As the authors of AIFFD noted in the preface they had to make a choice to use a common software for the book, they chose SAS, and any choice they would have made would have had some limitations. One of the goals of this portion of the fishR website is to attempt to perform all boxed example analyses in the book with R rather than SAS. R and SAS are completely different "animals" that can basically perform the same analyses. Converting from SAS to R will provide some difficulties because of these differences. The items identified in the sections below are differences or difficulties that are more general in nature and will, therefore, permeate through all analyses. I have chosen to pull them together in this one place to keep the further analyses streamlined. Please consider these items carefully before continuing with the other analyses. == Learning R Most of the AIFFD R examples on the fishR website assume that you have a working familiarity with R, just as the book assumes you have a working familiarity with SAS. Thus, the examples demonstrate how to perform specific analyses in R but they don't necessarily demonstrate basic functioning of R. Therefore, you may have to familiarize yourself with R with some of the following resources, - link:http://cran.r-project.org/doc/manuals/R-intro.html[An Introduction to R] - Dalgaard's link:http://www.biostat.ku.dk/~pd/ISwR.html[Introductory Statistics with R.] Springer, 2nd edition. - Verzani'2 link:http://wiener.math.csi.cuny.edu/UsingR/[Using R for Introductory Statistics.] Chapman & Hall/CRC - Crawley's link:http://www.bio.ic.ac.uk/research/crawley/statistics/[Statistics: An Introduction using R.] Wiley, 2005. Some basic information about using R may be gleaned from the vignettes on the link:../../../gnrlex[General Examples] tab. == Reading Data into R In most of the boxed examples the data are entered directly into SAS with the +*CARDS*+ statement. In the R versions of the boxed examples I have chosen to enter these data into external text files that can then be read into R with the +*[red]#read.table()#*+ function. All of these data files are available via a link on the main link:../AIFFD.html[AIFFD examples] page. The most efficient way to read these data into R is to first set the R working directory to where you saved the data file with the +*[red]#setwd()#*+ function. The +*[red]#setwd()#*+ function requires one argument which is the drive letter followed by the path to the folder containing the data file. This drive and path must be contined in double-quotes and the "slashes" between the drive and the path and between all folders in the path must be double forward-slashes. IMPORTANT: The argument in +*[red]#setwd()#*+ MUST be contained in quotes and MUST use two forward slashes. For example, I would change the R working directory to where I saved the data file for Box5.6 with, ---- > setwd("c://aaaWork//web//fishR//bookex//AIFFD//Box5_6") ---- It MUST be noted that this is the drive and path to the data file on MY computer. This will not work on YOUR computer unless your directory structure is exactly the same as mine (which is unlikely). Thus, if you are attempting to recreate the R box analyses you will have to make sure NOT to copy this command exactly; rather you will have to change it to exactly where you saved the data file. NOTE: The directory structure used above is where the file exists on MY computer; you must change to the directory structure where the file exists on YOUR computer. You should also note that all of the data files have been saved such that the first row contains names for the variables. In the R language it is said that these files have a +*[red]#header#*+. The default behavior for the +*[red]#read.table()#*+ function is to assume that there is not a header; thus, you must make sure to tell +*[red]#read.table()#*+ that the data file has a header. Thus, everytime you use +*[red]#read.table()#*+ with examples that I have authored you will need to include the filename (in quotes) as the first argument and +*[red]#header=TRUE#*+ as the second argument. For example, after setting the working directory above, I would read the data file for Box 5.6 (i.e., link:box5_6.txt[]) with, ---- > d <- read.table("box5_6.txt", header = TRUE) ---- A more thorough description of, and alternative methods for, reading data into R link:../../../gnrlex/DataEntry/DataEntry.pdf[can be found on the General Examples tab]. == Required Packages One of the major advantages of R is that the base R can be easily extended with contributed packages. Packages are sets of routines written by anybody in the world that either modify or, more likely, extend the abilities of base R. Many such packages are stored in the Comprehensive R Archive Network (CRAN), while others are stored on "Forge" sites (e.g., link:https://r-forge.r-project.org/[r-forge.r-project], link:http://rforge.net/[rforge.net], or link:http://sourceforge.net/[sourceforge]), while still others are stored on personal webpages. To use a particular package it must be installed on your computer. Install packages from CRAN is fairly easy using built-in menu options in the R GUI (e.g., on Windows go to the __Packages__ menu and then __Install package(s)...__ sub-menu, choose a mirror site, and then choose the package(s) you want to install). Most packages stored on a "Forge" site will have a small set of commands or a script on the specific "Forge" site that will install the package when copied into R. Packages on personal websites will likely have to be downloaded from that website and installed via a menu item in the R GUI (e.g., on Windows go to the __Packages__ menu and then __Install package(s) from local zip file(s)...__ sub-menu item and then browse to where you downloaded the ZIP file) or manually into the __library__ folder in the R folder. Packages only have to be installed once on a computer -- except that it will have to be re-installed when upgrading. Once a package is installed on a computer then R must be told to load that package for use. A package must be loaded everytime you open R and want to use the package but it only needs to be loaded once per R session. A package is loaded by including the package name (without quotes) as the single argument to the +*[red]#library()#*+ function. For example, the +*[red]#car#*+ package is loaded with, ---- > library(car) ---- IMPORTANT: Packages must be loaded in every R session in which you want to use functions from that package. In the R analysis box examples, all required packages will be loaded at the beginning of the example. == Philosophical Difference between R and SAS (just one!) A critical philosophical difference between R and SAS is that SAS will generally print out a great deal of information from every PROC that is run. R, on the other hand, prints out very little. The philosophy of R is that you need to explicitly ask for the results, whereas the philosophy of SAS is to print out everything it knows and let you sort through to find what you want. The philosophy of R is apparent in the two main types of functions -- constructor and extractor functions. A constructor function is a function that performs some task and likely stores a great deal of information in a seemingly invisible way. The results of an extractor function are usually saved into an R object. An extractor function is a function that acts on an R object to extract specific pieces of information from it. The typical analysis then is to use a constructor function to perform some analysis and then to use an extractor function to extract just the information that you require. For example, suppose that you have two variables -- Y and X -- for which you want to construct a linear regression model to predict Y from X. The +*[red]#lm()#*+ function is a constructor function that performs the linear regression (this function will be described in more detail in subsequent box vignettes). In the example below the results are stored into an object called +*[red]#lm1#*+ but you do not see any results, ---- > lm1 <- lm(Y ~ X) ---- If you would like to see the coefficients from the regression fit you can submit the +*[red]#lm1#*+ object to the +*[red]#coef()#*+ extractor function, ---- > coef(lm1) (Intercept) X 26.006740 3.093977 ---- If you want to see the ANOVA table then submit +*[red]#lm1#*+ to the +*[red]#anova()#*+ extractor function, ---- > anova(lm1) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) X 1 330.43 330.43 266.68 < 2.2e-16 Residuals 48 59.48 1.24 ---- So, the point of this is that you will have to change your attitude about getting results from R as compared to SAS. SAS prints everyting you might ever possibly want (and a lot that you probably don't understand) whereas R will only print results (and very specific results) when you ask for them. == Types (I,II,III) of Sums-of-Squares One of the major arguments between proponents of R and proponemts of SAS is over the definitions and uses of the variety of types of sums-of-squares (SS). SAS defines four types of SS, two of which are used throughout the AIFFD book -- Type-I and Type-III. I will attempt to describe each of these SS below and briefly describe how these will be handled in R. SAS' type-I SS are sometimes called __sequential__ or __variables-in-order__ SS because they are the SS explained by adding a variable to the model AFTER all previous variables have been added and BEFORE adding any subsequent variables. For example, consider a situation where there are three factor variables (A,B,C) and all of the two- and three-way interactions between these variables. In general, this model can be written as, Y = A + B + C + A:B + A:C + B:C + A:B:C The type-I SS for A:B, for example, would be the SS explained by A:B AFTER the main A, B, and C effects have been considered but withOUT considering the effects of A:C, B:C, or A:B:C. The type-I SS for B:C would be the SS explained by B:C AFTER all of the other terms except for A:B:C have been considered. In other words, the type-I SS are the SS explained by that term in the order that that term was entered into the model. Which, of course, means that the order that terms are entered into the model in important. SAS' type-III SS are sometimes called __marginal__ or __variables-added-last__ SS because they are the SS explained by adding a variable to the model AFTER ALL other variables have been included in the model, regardless of the order the varaibles were entered into the model. Using the same example as before, the type-III SS for A:B would be the SS explained by AFTER ALL other variables have been considered. If the experiment has equal numbers of experimental units in each cell defined by each combination of the factors (a so-called balanced design) then the type-I and type-III SS will be identical. However, if the design is unbalanced, some cells are missing observations, or a quantitative explanatory variables is included in the model (i.e., ANCOVA) then the type-I and type-III SS will not be equivalent. The type-I SS are computed in R by submitting a linear model or analysis of variance object to the +*[red]#anova()#*+ extractor function. Tye type-III SS as defined by SAS are a bit more difficult to extract in R. The +*car*+ package has the +*[red]#Anova()#*+ (note the capital "A") which has a +*[red]#type=#*+ argument which can be set equal to +*[red]#"III"#*+. However, under the default contrast structure of R (i.e., +*[red]#contr.treatment#*+), this function will NOT return the SAS type-III SS. However, if the contrast structure is changed (to either +*[red]#contr.sum#*+ or +*[red]#contr.helmert#*+) then +*[red]#Anova()#*+ with the +*[red]#type="III"#*+ argument will produce SAS type-III SS ONLY for the situation where the explanatory variables are all factors. In situations where at least one of the explanatory variables is quantitative (e.g., ANCOVA) then the type-III SS of SAS cannot be computed in R. As changing the contrast structure does not affect the type-I or type-II (see below) SS then it is generally a good idea to change the contrast structure early in your R script. Thus, in many of the AIFFD examples, you will see this command at the beginning of the vignette, ---- > options(contrasts = c("contr.sum", "contr.poly")) ---- WARNING: In the default settings, the SS computed when using the +*[red]#type="III"#*+ argument in the +*[red]#Anova()#*+ function from the +*car*+ package does NOT produce the same type-III SS as defined by SAS. The contrast structure must be changed to either +*[red]#contr.sum#*+ or +*[red]#contr.helmert#*+ to match SAS results. WARNING: The type-III SS produced by +*[red]#Anova()#*+ function from the +*car*+ package using a proper conrast structure will match SAS results only if all explanatory variables are factors. Fox suggested that any default SS and hypothesis testing (in contrast to very carefully defined hypothesis tests, model fitting, and SS calculations) should follow the principle-of-marginality where one ignores all higher-order "relatives" when testing lower-order terms. For example, in the three-factor example above, one would ignore A:B, A:C, and A:B:C when assessing the signficance of A. In other words, the type-II SS for A would be the SS explained by A AFTER considering B, C, and B:C but withOUT considering A:B, A:C, and A:B:C. Similarly, the SS for A:B would be the SS explaiend by A:B AFTER considering A, B, C, A:C, and B:C but not A:B:C (i.e., after considering all terms that do NOT contain A:B as part of the term). Further discussion of these different types of SS can be found at - link:http://books.google.com/books?id=xWS8kgRjGcAC&pg=PA132&lpg=PA132&dq=%22Sums-of-squares%22+types+explained+Fox&source=bl&ots=o4qzmGGpW8&sig=gdT4Us2pg3qdN7HdLPEDzqR7Iuk&hl=en&ei=AkQ6SqWQEZKOsgPt1pn-Bg&sa=X&oi=book_result&ct=result&resnum=4#PPA140,M1[Fox's An R and S-Plus companion to applied regression] - link:http://books.google.com/books?id=JhxKC8ewcdAC&pg=PA182&lpg=PA182&dq=%22Sums-of-squares%22+types+explained&source=bl&ots=SJkaNzNzeK&sig=8eiDwqtG-OITe5be5ErUgxvrkQc&hl=en&ei=lkI6St-IL4yoswPppvj9Bg&sa=X&oi=book_result&ct=result&resnum=5#PPA182,M1[Littell _et al._'s SAS for Linear Models] - link:http://www.stat.wisc.edu/~yandell/software/sas/linmod.html#ss[Yandell's description here] - link:http://www1.umn.edu/statsoft/doc/statnotes/stat05.txt[These notes from the University of Minnesota] NOTE: In ANCOVA situations I will use the type-II SS from the +*[red]#Anova()#*+ function in the +*car*+ package despite the fact that the R results will not perfectly match the SAS results shown in the AIFFD book. In addition, for all other linear model analyses I will show the type-II results, in addition to what was illustrated in the book (type-I or type-III). == Least-Squares Means The second major argument between proponents of R and proponents of SAS is over the use of so-called "least-squares means." The terminology "least-squares means" is largely a SAS construct and near synonyms include "adjusted means", "marginal means", or "estimated marginal means." In general, a "least-squares mean" is the mean for a group after having controlled for other variables -- i.e., other factors or quantitative covariates. The most common "least-squares mean" is the calculation of adjusted group means after holding a quantitative covariate at a typical value (say the mean) in an ANCOVA. Users of R argue for a more general approach that has been implemented in Fox's +*effects*+ package. However, Yandell has provided a +*[red]#lsmeans()#*+ function in his +*pda*+ package that can efficiently reproduce the SAS least-squares means used in the AIFFD examples. In the initial versions of the AIFFD boxed vignettes I will use Yandell's function rather than the functions in +*effects*+. Yandell's +*pda*+ package is available from his link:http://www.stat.wisc.edu/~yandell/pda/[website] or from a link:https://r-forge.r-project.org/R/?group_id=49[sourceforge site]. More resources on least-squares means can be found at - link:http://onbiostatistics.blogspot.com/2009/04/least-squares-means-marginal-means-vs.html[On Biostatistics and Clinical Trial blog] - link:http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/glm_sect34.htm[SAS support]