// library(ascii) // setwd("c://aaaWork//web//fishR//bookex//AIFFD//Box3_8") // Asciidoc("Box3_8a.Rnw") AIFFD Box 3-8/3-9 Vignette =========================== :Author: Derek H. Ogle :Email: dogle@northland.edu :Date: 10-June-2010 :Revision: 4 == Required Packages and Setting Options ---- > library(car) # Anova, levene.test, outlier.test > library(pda) # lsmean > library(NCStats) # fit.plot > library(nortest) # ad.test, cvm.test > options(contrasts=c("contr.sum","contr.poly")) ---- == Preparing Data The working directory is set, the link:box3_8.txt[] data file is read, and the structure of the data frame is observed with, ---- > setwd("c://aaaWork//web//fishR//bookex//AIFFD//Box3_8") > d <- read.table("box3_8.txt", header = TRUE) > str(d) 'data.frame': 22 obs. of 4 variables: $ lake : Factor w/ 22 levels "Bass","Black",..: 21 13 16 1 15 20 14 12 19 10 ... $ region : Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ... $ bag_limit: int 1 1 1 2 2 2 2 5 5 5 ... $ catch : num 2.21 2.32 2.74 2.23 2.25 1.4 2.36 1.78 1.64 1.97 ... ---- A look at the structure of the data frame shows that +*bag_limit*+ is an "integer" type variable. To properly perform the ANOVA this variable must be converted to a factor variable with, ---- > d$fbag_limit <- factor(d$bag_limit) ---- NOTE: The explanatory variable to be used in an analysis of variance must be a factor variable in R. An apparent quantitative variable can be coereced to a factor with +*[red]#factor()#*+. == ANOVA Results I The one-way ANOVA model can then be fit using +*[red]#lm()#*+ (see link:../preliminaries/LinearModels.html[linear models vignette] for an introduction) and saved into the +*lm1*+ object with, ---- > lm1 <- lm(catch ~ fbag_limit, data = d) ---- The anova table of type-I (sequential) SS is then obtained by submitting the +*lm1*+ object to the +*[red]#anova()#*+ extractor function, ---- > anova(lm1) Analysis of Variance Table Response: catch Df Sum Sq Mean Sq F value Pr(>F) fbag_limit 2 1.6232 0.81159 2.426 0.1153 Residuals 19 6.3561 0.33453 ---- The type-III SS (see discussion about SS in the link:../preliminaries/preliminaries.html[preliminaries vignette]) are computed by submitting the same +*lm1*+ object to +*[red]#Anova()#*+ including the +*[red]#type="III"#*+ argument, ---- > Anova(lm1, type = "III") Anova Table (Type III tests) Response: catch Sum Sq Df F value Pr(>F) (Intercept) 129.364 1 386.701 4.33e-14 fbag_limit 1.623 2 2.426 0.1153 Residuals 6.356 19 ---- == Least-Squares Means The least-squares means (see discussion about least-squares means in the link:../preliminaries/preliminaries.html[preliminaries vignette]) are computed by submitting +*lm1*+ object to +*[red]#lsmean()#*+ from the +*pda*+ package as follows, ---- > lsmean(lm1) fbag_limit pred se 1 1 2.736667 0.2361261 2 2 2.540000 0.2186103 5 5 2.100000 0.1927962 ---- == A Means Plot The +*[red]#fit.plot()#*+ function from the +*NCStats*+ package can be used to visually observe the mean (with default 95% CIs) catch rates of the different bag limit groups. This function only requires the +*lm1*+ object as the first argument. Optional arguments can be used to label the x-axis, y-axis, and main title as illustrated below, ---- > fit.plot(lm1, xlab = "Walleye Bag Limit", ylab = "Walleye Catch Rate", main = "") ---- image::Box3_8a-009.png[] == ANOVA Results II The same model fits summarized with type-II SS (see discussion about SS in the link:../preliminaries/preliminaries.html[preliminaries vignette]) are obtained with, ---- > Anova(lm1, type = "II") Anova Table (Type II tests) Response: catch Sum Sq Df F value Pr(>F) fbag_limit 1.6232 2 2.426 0.1153 Residuals 6.3561 19 ---- The conclusions from these summaries are identical to the conclusions from above. == Testing the Normality Assumption R does not have one function that prints out the four tests of normality as shown in Box 3.9. However, all four tests of normality shown in Box 3.9 can be accomplished with the following functions, * +*[red]#ad.test()#*+: performs Anderson-Darling test. * +*[red]#cvm.test()#*+: performs Cramer-von Mises test. * +*[red]#ks.test()#*+: performs Kolmogorov-Smirnov test. * +*[red]#shapiro.test()#*+: performs Shapiro-Wilk test. The last two functions listed above are available in the base +*stats*+ package distributed with R. However, the first two functions require the +*nortest*+ package. Each of these functions requires the residuals from the linear model fit as the first argument. These residuals can be extracted by appending +*[red]#$residuals#*+ to the linear model object name. For example, ---- > lm1$residuals 1 2 3 4 5 6 7 8 9 10 11 -0.526666667 -0.416666667 0.003333333 -0.310000000 -0.290000000 -1.140000000 -0.180000000 -0.320000000 -0.460000000 -0.130000000 -0.110000000 12 13 14 15 16 17 18 19 20 21 22 -0.036666667 0.893333333 0.083333333 0.550000000 1.090000000 0.280000000 0.100000000 -0.360000000 0.750000000 0.910000000 -0.380000000 ---- With these residuals, the Anderson-Darling, Cramer-von Mises, and Shaprio-Wilk tests can be conducted with, ---- > ad.test(lm1$residuals) Anderson-Darling normality test data: lm1$residuals A = 0.7002, p-value = 0.05793 > cvm.test(lm1$residuals) Cramer-von Mises normality test data: lm1$residuals W = 0.1207, p-value = 0.05387 > shapiro.test(lm1$residuals) Shapiro-Wilk normality test data: lm1$residuals W = 0.9336, p-value = 0.1460 ---- WARNING: The test statistics for the Anderson-Darling and Cramer-von Mises tests are the same in SAS and R but the p-value are slightly different. The +*[red]#ks.test()#*+ function for performing the Kolmogorov-Smirnov test is a more general function that can test whether the residuals follow any possible continuous distribution. To use this function to simply test one set of data for normality you must tell the function, as the second argument to the function, to test against a cumulative normal distribution. In R, the +*[red]#pnorm()#*+: function contains the cumulative normal distribution. Thus, a Kolmogorov-Smirnov test of normality is conducted with, ---- > ks.test(lm1$residuals, "pnorm") One-sample Kolmogorov-Smirnov test data: lm1$residuals D = 0.2538, p-value = 0.09785 alternative hypothesis: two-sided ---- WARNING: The results for the Kolmogorov-Smirnov test are not the same in SAS and R. The Q-Q normal probability plot can be constructed by submitting the linear model residuals to +*[red]#qqnorm()#*+. Personally, I also like to see the histogram of residuals constructed by submittnig the linear model residuals to +*[red]#hist()#*+. The code below will put these two plots side-by-side in one plot for simplicity, ---- > par(mfrow=c(1,2)) # make a one row, two column "grid" for the plots > hist(lm1$residuals) > qqnorm(lm1$residuals) ---- image::Box3_8a-014.png[] == Testing the Equal Variances Assumption The authors of Chapter 3 mention two tests for equal variances without showing how to perform these tests. In R, these tests are performed with, * +*[red]#bartlett.test()#*+: performs Bartlett's test. * +*[red]#levene.test()#*+: performs Levene's test. The +*[red]#bartlett.test()#*+ function is found in the base +*stats*+ package distributed with R. The +*[red]#levene.test()#*+ function is found in both the +*pda*+ and +*car*+ packages. The version in +*car*+ is more flexible and is preferred. Because the +*pda*+ package has also been loaded, you will have to force R to use the +*car*+ version by including +*[red]#car::#*+ before +*[red]#levene.test()#*+. NOTE: If two packages have functions with the exact same name you can force R to use the function in one particular package by typing that package name, two colons, and the name of the function. If you do not force R to use the function from one particular package it will use the function in the package that was most recently loaded. Both of these functions can take the linear model formula as their first arguments. For example (remember we must force R to use the +*car*+ version), ---- > bartlett.test(catch ~ fbag_limit, data = d) Bartlett test of homogeneity of variances data: catch by fbag_limit Bartlett's K-squared = 1.0397, df = 2, p-value = 0.5946 > car::levene.test(catch ~ fbag_limit, data = d) Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 2 0.491 0.6196 19 ---- However, as a matter of simplicity, +*[red]#levene.test()#*+ can also take the linear model object as it's first argument, ---- > car::levene.test(lm1) Levene's Test for Homogeneity of Variance Df F value Pr(>F) group 2 0.491 0.6196 19 ---- The boxplot of the residuals, constructed with +*[red]#residual.plot()#*+ from the +*NCStats*+ package, can be used to produce a general graphic for visualizing the variability among groups. The boxplot is constructed by submitting the linear model object to +*[red]#residual.plot()#*+ as follows, ---- > residual.plot(lm1) ---- image::Box3_8a-017.png[] == Outlier Detection The authors of Chapter 3 mentioned that outliers can be problematic in linear modeling. They mention that individuals with residuals more than three standard deviations from the mean may be considered to be "outliers." A Studentized residual is essentially a residual that has been standardized to approximately follow a t-distribution. Thus, any individual with an absolute value studentized residual greater than three could be considered an outlier with the author's suggestion. Studentized residuals for each individual can be computed in R by submitting the linear model object to +*[red]#studres()#*+ as follows, ---- > studres(lm1) 1 2 3 4 5 6 7 8 9 10 11 -0.997346761 -0.781008515 0.006144826 -0.568511579 -0.531238189 -2.374623760 -0.328156405 -0.576418247 -0.836878964 -0.232385953 -0.196550543 12 13 14 15 16 17 18 19 20 21 22 -0.067601593 1.786918177 0.153721287 1.028680075 2.240563324 0.512647324 0.178649043 -0.650066160 1.410748935 1.758219998 -0.687102631 ---- As a more objective measure the +*car*+ package provides +*[red]#outlier.test()#*+ which will extract the largest residual and provide a p-value that uses the Bonferroni method to correct for inflated Type-I errors due to multiple comparisons. The null hypothesis for this test is that the individual is NOT an outlier. The +*[red]#outlier.test()#*+ function requires the linear model object as such, ---- > outlier.test(lm1) max|rstudent| = 2.374624, degrees of freedom = 18, unadjusted p = 0.02889068, Bonferroni p = 0.6355949 Observation: 6 ---- This result suggests that the most probable outlier is individual six and it is likely not an outlier (i.e., large Bonferroni p-value).