> library(car) # Anova, levene.test, outlier.test
> library(pda) # lsmean
> library(NCStats) # fit.plot
> library(nortest) # ad.test, cvm.test
> options(contrasts=c("contr.sum","contr.poly"))
> library(car) # Anova, levene.test, outlier.test
> library(pda) # lsmean
> library(NCStats) # fit.plot
> library(nortest) # ad.test, cvm.test
> options(contrasts=c("contr.sum","contr.poly"))
The working directory is set, the box3_8.txt data file is read, and the structure of the data frame is observed with,
> setwd("c://aaaWork//web//fishR//bookex//AIFFD//Box3_8")
> d <- read.table("box3_8.txt", header = TRUE)
> str(d)
'data.frame': 22 obs. of 4 variables:
$ lake : Factor w/ 22 levels "Bass","Black",..: 21 13 16 1 15 20 14 12 19 10 ...
$ region : Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ...
$ bag_limit: int 1 1 1 2 2 2 2 5 5 5 ...
$ catch : num 2.21 2.32 2.74 2.23 2.25 1.4 2.36 1.78 1.64 1.97 ...
A look at the structure of the data frame shows that bag_limit is an "integer" type variable. To properly perform the ANOVA this variable must be converted to a factor variable with,
> d$fbag_limit <- factor(d$bag_limit)
The one-way ANOVA can then be fit using lm() (for "linear model") with,
> lm1 <- lm(catch ~ fbag_limit, data = d)
and the anova table of type-I (sequential) SS is obtained with,
> anova(lm1)
Analysis of Variance Table
Response: catch
Df Sum Sq Mean Sq F value Pr(>F)
fbag_limit 2 1.6232 0.8116 2.426 0.1153
Residuals 19 6.3561 0.3345
The type-III SS are computed with (see discussion about SS in the preliminaries vignette),
> Anova(lm1, type = "III")
Anova Table (Type III tests)
Response: catch
Sum Sq Df F value Pr(>F)
(Intercept) 129.364 1 386.701 4.33e-14
fbag_limit 1.623 2 2.426 0.1153
Residuals 6.356 19
The least-squares means are computed with lsmean() from the pda package (see discussion about least-squares means in the preliminaries vignette) as follows,
> lsmean(lm1) fbag_limit pred se 1 1 2.736667 0.2361261 2 2 2.540000 0.2186103 5 5 2.100000 0.1927962
The fit.plot() function from the NCStats package can be used to visually observe the mean (with default 95% CIs) catch rates of the different bag limit groups.
> fit.plot(lm1, xlab = "Walleye Bag Limit", ylab = "Walleye Catch Rate", main = "")
The same model fits summarized with type-II SS (see discussion about SS in the preliminaries vignette)) are obtained with,
> Anova(lm1, type = "II")
Anova Table (Type II tests)
Response: catch
Sum Sq Df F value Pr(>F)
fbag_limit 1.6232 2 2.426 0.1153
Residuals 6.3561 19
The conclusions from these summaries are identical to the conclusions from above.
R does not have one function that prints out the four tests of normality as shown in Box 3.9. However, all four tests of normality shown in Box 3.9 can be accomplished with the following functions,
ad.test(): performs Anderson-Darling test.
cvm.test(): performs Cramer-von Mises test.
ks.test(): performs Kolmogorov-Smirnov test.
shapiro.test(): performs Shapiro-Wilk test.
The last two functions listed above are available in the base stats package distributed with R. However, the first two functions require the nortest package.
Each of these functions requires the residuals from the linear model fit as the first argument. These residuals can be extracted by appending $residuals to the linear model object name. For example,
> lm1$residuals
1 2 3 4 5 6 7 8 9 10 11
-0.526666667 -0.416666667 0.003333333 -0.310000000 -0.290000000 -1.140000000 -0.180000000 -0.320000000 -0.460000000 -0.130000000 -0.110000000
12 13 14 15 16 17 18 19 20 21 22
-0.036666667 0.893333333 0.083333333 0.550000000 1.090000000 0.280000000 0.100000000 -0.360000000 0.750000000 0.910000000 -0.380000000
With these residuals, the Anderson-Darling, Cramer-von Mises, and Shaprio-Wilk tests can be conducted with,
> ad.test(lm1$residuals)
Anderson-Darling normality test
data: lm1$residuals
A = 0.7002, p-value = 0.05793
> cvm.test(lm1$residuals)
Cramer-von Mises normality test
data: lm1$residuals
W = 0.1207, p-value = 0.05387
> shapiro.test(lm1$residuals)
Shapiro-Wilk normality test
data: lm1$residuals
W = 0.9336, p-value = 0.1460
|
The test statistics for the Anderson-Darling and Cramer-von Mises tests are the same in SAS and R but the p-value are slightly different. |
The ks.test() function for performing the Kolmogorov-Smirnov test is a more general function that can test whether the residuals follow any possible continuous distribution. To use this function to simply test one set of data for normality you must tell the function, as the second argument to the function, to test against a cumulative normal distribution. In R, the pnorm(): function contains the cumulative normal distribution. Thus, a Kolmogorov-Smirnov test of normality is conducted with,
> ks.test(lm1$residuals, "pnorm")
One-sample Kolmogorov-Smirnov test
data: lm1$residuals
D = 0.2538, p-value = 0.09785
alternative hypothesis: two-sided
|
The results for the Kolmogorov-Smirnov test are not the same in SAS and R. |
The Q-Q normal probability plot can be constructed by submitting the linear model residuals to qqnorm(). Personally, I also like to see the histogram of residuals constructed by submittnig the linear model residuals to hist(). The code below will put these two plots side-by-side in one plot for simplicity,
> par(mfrow=c(1,2)) # make a one row, two column "grid" for the plots > hist(lm1$residuals) > qqnorm(lm1$residuals)
The authors of Chapter 3 mention two tests for equal variances without showing how to perform these tests. In R, these tests are performed with,
bartlett.test(): performs Bartlett’s test.
levene.test(): performs Levene’s test.
The bartlett.test() function is found in the base stats package distributed with R. The levene.test() function is found in both the pda and car packages. The version in car is more flexible and is preferred. Because the pda package has also been loaded, you will have to force R to use the car version by including car:: before levene.test().
|
If two packages have functions with the exact same name you can force R to use the function in one particular package by typing that package name, two colons, and the name of the function. If you do not force R to use the function from one particular package it will use the function in the package that was most recently loaded. |
Both of these functions can take the linear model formula as their first arguments. For example (remember we must force R to use the car version),
> bartlett.test(catch ~ fbag_limit, data = d)
Bartlett test of homogeneity of variances
data: catch by fbag_limit
Bartlett's K-squared = 1.0397, df = 2, p-value = 0.5946
> car::levene.test(catch ~ fbag_limit, data = d)
Levene's Test for Homogeneity of Variance
Df F value Pr(>F)
group 2 0.491 0.6196
19
However, as a matter of simplicity, levene.test() can also take the linear model object as it’s first argument,
> car::levene.test(lm1)
Levene's Test for Homogeneity of Variance
Df F value Pr(>F)
group 2 0.491 0.6196
19
The boxplot of the residuals, constructed with residual.plot() from the NCStats package, can be used to produce a general graphic for visualizing the variability among groups. The boxplot is constructed by submitting the linear model object to residual.plot() as follows,
> residual.plot(lm1)
The authors of Chapter 3 mentioned that outliers can be problematic in linear modeling. They mention that individuals with residuals more than three standard deviations from the mean may be considered to be "outliers." A Studentized residual is essentially a residual that has been standardized to approximately follow a t-distribution. Thus, any individual with an absolute value studentized residual greater than three could be considered an outlier with the author’s suggestion. Studentized residuals for each individual can be computed in R by submitting the linear model object to studres() as follows,
> studres(lm1)
1 2 3 4 5 6 7 8 9 10 11
-0.997346761 -0.781008515 0.006144826 -0.568511579 -0.531238189 -2.374623760 -0.328156405 -0.576418247 -0.836878964 -0.232385953 -0.196550543
12 13 14 15 16 17 18 19 20 21 22
-0.067601593 1.786918177 0.153721287 1.028680075 2.240563324 0.512647324 0.178649043 -0.650066160 1.410748935 1.758219998 -0.687102631
As a more objective measure the car package provides outlier.test() which will extract the largest residual and provide a p-value that uses the Bonferroni method to correct for inflated Type-I errors due to multiple comparisons. The null hypothesis for this test is that the individual is NOT an outlier. The outlier.test() function requires the linear model object as such,
> outlier.test(lm1) max|rstudent| = 2.374624, degrees of freedom = 18, unadjusted p = 0.02889068, Bonferroni p = 0.6355949 Observation: 6
This result suggests that the most probable outlier is individual six and it is likely not an outlier (i.e., large Bonferroni p-value).