Required Packages and Setting Options

> library(car)      # Anova, levene.test, outlier.test
> library(pda)      # lsmean
> library(NCStats)  # fit.plot
> library(nortest)  # ad.test, cvm.test
> options(contrasts=c("contr.sum","contr.poly"))

Preparing Data

The working directory is set, the box3_8.txt data file is read, and the structure of the data frame is observed with,

> setwd("c://aaaWork//web//fishR//bookex//AIFFD//Box3_8")
> d <- read.table("box3_8.txt", header = TRUE)
> str(d)
'data.frame':   22 obs. of  4 variables:
 $ lake     : Factor w/ 22 levels "Bass","Black",..: 21 13 16 1 15 20 14 12 19 10 ...
 $ region   : Factor w/ 2 levels "North","South": 1 1 1 1 1 1 1 1 1 1 ...
 $ bag_limit: int  1 1 1 2 2 2 2 5 5 5 ...
 $ catch    : num  2.21 2.32 2.74 2.23 2.25 1.4 2.36 1.78 1.64 1.97 ...

A look at the structure of the data frame shows that bag_limit is an "integer" type variable. To properly perform the ANOVA this variable must be converted to a factor variable with,

> d$fbag_limit <- factor(d$bag_limit)

ANOVA Results I

The one-way ANOVA can then be fit using lm() (for "linear model") with,

> lm1 <- lm(catch ~ fbag_limit, data = d)

and the anova table of type-I (sequential) SS is obtained with,

> anova(lm1)
Analysis of Variance Table

Response: catch
           Df Sum Sq Mean Sq F value Pr(>F)
fbag_limit  2 1.6232  0.8116   2.426 0.1153
Residuals  19 6.3561  0.3345

The type-III SS are computed with (see discussion about SS in the preliminaries vignette),

> Anova(lm1, type = "III")
Anova Table (Type III tests)

Response: catch
             Sum Sq Df F value   Pr(>F)
(Intercept) 129.364  1 386.701 4.33e-14
fbag_limit    1.623  2   2.426   0.1153
Residuals     6.356 19

Least-Squares Means

The least-squares means are computed with lsmean() from the pda package (see discussion about least-squares means in the preliminaries vignette) as follows,

> lsmean(lm1)
  fbag_limit     pred        se
1          1 2.736667 0.2361261
2          2 2.540000 0.2186103
5          5 2.100000 0.1927962

A Means Plot

The fit.plot() function from the NCStats package can be used to visually observe the mean (with default 95% CIs) catch rates of the different bag limit groups.

> fit.plot(lm1, xlab = "Walleye Bag Limit", ylab = "Walleye Catch Rate", main = "")
Box3_8a-009.png

ANOVA Results II

The same model fits summarized with type-II SS (see discussion about SS in the preliminaries vignette)) are obtained with,

> Anova(lm1, type = "II")
Anova Table (Type II tests)

Response: catch
           Sum Sq Df F value Pr(>F)
fbag_limit 1.6232  2   2.426 0.1153
Residuals  6.3561 19

The conclusions from these summaries are identical to the conclusions from above.

Testing the Normality Assumption

R does not have one function that prints out the four tests of normality as shown in Box 3.9. However, all four tests of normality shown in Box 3.9 can be accomplished with the following functions,

The last two functions listed above are available in the base stats package distributed with R. However, the first two functions require the nortest package.

Each of these functions requires the residuals from the linear model fit as the first argument. These residuals can be extracted by appending $residuals to the linear model object name. For example,

> lm1$residuals
           1            2            3            4            5            6            7            8            9           10           11
-0.526666667 -0.416666667  0.003333333 -0.310000000 -0.290000000 -1.140000000 -0.180000000 -0.320000000 -0.460000000 -0.130000000 -0.110000000
          12           13           14           15           16           17           18           19           20           21           22
-0.036666667  0.893333333  0.083333333  0.550000000  1.090000000  0.280000000  0.100000000 -0.360000000  0.750000000  0.910000000 -0.380000000

With these residuals, the Anderson-Darling, Cramer-von Mises, and Shaprio-Wilk tests can be conducted with,

> ad.test(lm1$residuals)
        Anderson-Darling normality test

data:  lm1$residuals
A = 0.7002, p-value = 0.05793
> cvm.test(lm1$residuals)
        Cramer-von Mises normality test

data:  lm1$residuals
W = 0.1207, p-value = 0.05387
> shapiro.test(lm1$residuals)
        Shapiro-Wilk normality test

data:  lm1$residuals
W = 0.9336, p-value = 0.1460
Warning The test statistics for the Anderson-Darling and Cramer-von Mises tests are the same in SAS and R but the p-value are slightly different.

The ks.test() function for performing the Kolmogorov-Smirnov test is a more general function that can test whether the residuals follow any possible continuous distribution. To use this function to simply test one set of data for normality you must tell the function, as the second argument to the function, to test against a cumulative normal distribution. In R, the pnorm(): function contains the cumulative normal distribution. Thus, a Kolmogorov-Smirnov test of normality is conducted with,

> ks.test(lm1$residuals, "pnorm")
        One-sample Kolmogorov-Smirnov test

data:  lm1$residuals
D = 0.2538, p-value = 0.09785
alternative hypothesis: two-sided
Warning The results for the Kolmogorov-Smirnov test are not the same in SAS and R.

The Q-Q normal probability plot can be constructed by submitting the linear model residuals to qqnorm(). Personally, I also like to see the histogram of residuals constructed by submittnig the linear model residuals to hist(). The code below will put these two plots side-by-side in one plot for simplicity,

> par(mfrow=c(1,2))      # make a one row, two column "grid" for the plots
> hist(lm1$residuals)
> qqnorm(lm1$residuals)
Box3_8a-014.png

Testing the Equal Variances Assumption

The authors of Chapter 3 mention two tests for equal variances without showing how to perform these tests. In R, these tests are performed with,

The bartlett.test() function is found in the base stats package distributed with R. The levene.test() function is found in both the pda and car packages. The version in car is more flexible and is preferred. Because the pda package has also been loaded, you will have to force R to use the car version by including car:: before levene.test().

Note If two packages have functions with the exact same name you can force R to use the function in one particular package by typing that package name, two colons, and the name of the function. If you do not force R to use the function from one particular package it will use the function in the package that was most recently loaded.

Both of these functions can take the linear model formula as their first arguments. For example (remember we must force R to use the car version),

> bartlett.test(catch ~ fbag_limit, data = d)
        Bartlett test of homogeneity of variances

data:  catch by fbag_limit
Bartlett's K-squared = 1.0397, df = 2, p-value = 0.5946
> car::levene.test(catch ~ fbag_limit, data = d)
Levene's Test for Homogeneity of Variance
      Df F value Pr(>F)
group  2   0.491 0.6196
      19

However, as a matter of simplicity, levene.test() can also take the linear model object as it’s first argument,

> car::levene.test(lm1)
Levene's Test for Homogeneity of Variance
      Df F value Pr(>F)
group  2   0.491 0.6196
      19

The boxplot of the residuals, constructed with residual.plot() from the NCStats package, can be used to produce a general graphic for visualizing the variability among groups. The boxplot is constructed by submitting the linear model object to residual.plot() as follows,

> residual.plot(lm1)
Box3_8a-017.png

Outlier Detection

The authors of Chapter 3 mentioned that outliers can be problematic in linear modeling. They mention that individuals with residuals more than three standard deviations from the mean may be considered to be "outliers." A Studentized residual is essentially a residual that has been standardized to approximately follow a t-distribution. Thus, any individual with an absolute value studentized residual greater than three could be considered an outlier with the author’s suggestion. Studentized residuals for each individual can be computed in R by submitting the linear model object to studres() as follows,

> studres(lm1)
           1            2            3            4            5            6            7            8            9           10           11
-0.997346761 -0.781008515  0.006144826 -0.568511579 -0.531238189 -2.374623760 -0.328156405 -0.576418247 -0.836878964 -0.232385953 -0.196550543
          12           13           14           15           16           17           18           19           20           21           22
-0.067601593  1.786918177  0.153721287  1.028680075  2.240563324  0.512647324  0.178649043 -0.650066160  1.410748935  1.758219998 -0.687102631

As a more objective measure the car package provides outlier.test() which will extract the largest residual and provide a p-value that uses the Bonferroni method to correct for inflated Type-I errors due to multiple comparisons. The null hypothesis for this test is that the individual is NOT an outlier. The outlier.test() function requires the linear model object as such,

> outlier.test(lm1)
max|rstudent| = 2.374624, degrees of freedom = 18,
unadjusted p = 0.02889068, Bonferroni p = 0.6355949

Observation: 6

This result suggests that the most probable outlier is individual six and it is likely not an outlier (i.e., large Bonferroni p-value).