Skip to contents

Abstract

The R package visStatistics provides means to quickly visualise and analyse raw data by selecting, based on a decision tree, the statistical hypothesis test with the highest statistical power between the dependent variable (response) named varsample and the independent variable (feature) named varfactor in a data.frame named dataframe. A minimal function call has the structure:

visstat(dataframe,varsample,varfactor)

The data in the provided dataframe must be structured by column, where varsample and varfactor are the character strings of the column names of the dependent (response) and independent (feature) variables, respectively.

The choice of statistical tests performed by the core function visstat() depends on wether the data are numerical or categorical, the number of levels of categorical data and the data distributions (normal versus non-normal).

Data of class "numeric" or "integer" will be referred in the reminder of this vignette as “numerical”, data of class "factor" will be referred in the remainder of this vignette as “categorical”.

The function returns the corresponding test statistics, including any post-hoc-analysis, and generates a graph showing the key statistics of the underlying test.

visStatistics provides a fully automated workflow. It has been successfully used for the unbiased analysis of raw medical data.

The remainder of this vignette focuses on the the algorithm underlying the decision tree of vistat(), while a call to ?visstat documents all parameter settings of the function.

All implemented statistical tests are called with their default parameter sets, except for conf.level, which can be adjusted in the visstat() function call. As a consequence, no paired tests are available.

Comparing central tendencies

If the feature consists of data of class "factor" with two or more levels and the response consists of data of class "numeric" or "integer" (both of mode "numeric"), tests are applied to compare the central tendencies.

Two sample tests: Welch’s t-test or Wilcoxon rank sum test

If the feature has exactly two levels,Welch’s t-test or its non-parametric alternative, the Wilcoxon rank sum test, is performed. Welch’s t-test tests the null hypothesis that the two samples have the same mean. While Student’s t-test requires homoscedasticity, Welch’s t-test does not. It loses little robustness compared to Student’s t-test when assumptions of Student’s t-test are met (Moser and Stevens 1992; Delacre:2017iv?). Therefore the implementation of Studen’t-test has been omitted. The null hypothesis of the Wilcoxon rank sum test is that both samples (levels) are drawn from the same population or have identical distributions.

The test choice follows the algorithm below:

  • If the sample size for both levels is greater than 30, always perform Welch’s t-test (t.test()) (@ Rasch, Kubinger, and Moder 2011; Lumley et al. 2002).

  • If the sample size of at least one of the levels is smaller than 30, first check for normality of both levels with the Shapiro-Wilk normality test (shapiro.test()):

  • If the p-values of the shapiro.test() of both levels are greater than the error probability \(\alpha = 1-\)conf.level, perform the Welchs’ t-tes (t.test()).

  • If the p-value of at least one of the levels in the shapiro.test() is smaller than the error probability \(\alpha=1-\)conf.level, a Wilcoxon rank sum test (wilcox.test()) is executed.

The graphical representation consists of box plots overlaid with jitter plots showing each data point. In the case of Welch’s t-test, theconf.leve\(\cdot 100 \%\) - confidence intervals are also shown. The test statistics of the chosen test as well as the summary statistics of the generated box plots are returned as a list.

Examples

Welch’s t-test

As an example we use the motor trend car road test data set (mtcars), which consists of 32 observations. In the example below mpg denotes miles per US gallon, am the transmission type ((0 = automatic, 1 = manual)).

mtcars$am <- as.factor(mtcars$am)
t_test_statistics <- visstat(mtcars, "mpg", "am")

# t_test_statistics  # Uncomment this line to print out the test statistics

Increasing the confidence level conf.level from the default 0.95 to 0.99 results in wider confidence intervals:

mtcars$am <- as.factor(mtcars$am)
t_test_statistics_99 <- visstat(mtcars, "mpg", "am", conf.level = 0.99)

Wilcoxon rank sum test
grades_gender <- data.frame(
sex = as.factor(c(rep("girl", 21), rep("boy", 23))),
grade = c(
19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4,
20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.0,
16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4,
7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6, 17.3, 19.9, 4.4, 2.1
)
)

wilcoxon_statistics <- visstat(grades_gender, "grade", "sex")

One-way test, ANOVA or Kruskal-Wallis test

If the feature consists of data of class "factor" with more than two levels and the response is of mode "numeric", visstat() performs an analysis of variance (ANOVA),if both of the following null hypotheses cannot be rejected on the chosen error of probability of error \(\alpha=1-\)conf.level

  1. Normality of the standardised residuals and

  2. homoscedasticity.

If only the first condition for the normality of the residuals is met, visstat() performs a one-way test (see oneway.test()). If the normality of the residuals cannot be assumed, a Kruskal-Wallis test (kruskal.test()) is used. These assumptions are tested by the visAnovaAssumptions() function.

Checking the ANOVA assumptions

Residual analysis

The function visAnovaAssumptions() checks the standardised residuals of the ANOVA fit for normality using both the Shapiro-Wilk-test shapiro.test() and the Anderson-Darling test ad.test(). It plots the standardised residuals against the fitted means of the linear model for each level of the feature varfactorand generates the Q-Q plot of the standardised residuals.

Homoscedasticity: homogeneity of variances in each level: Bartlett test

Both aov() and oneway.test() test whether two or more samples from normal distributions have the same mean. While aov() requires homogeneity of variances in each level (group), the oneway.test() does not require that the variances in each level are necessarily equal. Homoscedasticity is assessed using the Bartlett test, see bartlett.test(), under the null hypothesis that the variances in each of the levels are equal.

One-way test and ANOVA

Depending on the p-value of the bartlett.test(), the corresponding test is shown in the figure title:

  • If the p-value of the bartlett.test() is greater than 1-conf.level, we assume homogeneity of variances in each level (group) and the p-values of aov() is displayed.

  • Otherwise homoscedasticity cannot be assumed and the p-value of oneway.test() is reported.

Post-hoc analysis: Tukey`s honestly significant differences (HSD) and Sidak corrected confidence intervals

Simple multiple comparisons of the means for the factor levels in an analysis of variance inflate the probability of declaring a significant difference when, in fact, there is none . The family-wise error rate (also called the probability of a type I error) is the probability of at least one false positive comparison, in which the null hypothesis is falsely rejected, when multiple comparisons are performed.

Tukey`s honestly significant differences (HSD)

visstat() reduces this probability of a type I error by using Tukey’s honestly significant differences (HSD, see TukeyHSD()). It creates a set of confidence intervals on the differences between the means per factor level with the specified family-wise probability conf. If a confidence interval does not include the zero, then there is a significant difference between the pair. The set of confidence intervals for all pairwise comparisons is returned along with the Tukey HSD adjusted p-values.

In the graphical representation of One-way test and ANOVA, the green letters between two levels differ only, if the Tukey’s HSD corrected p-value for these two levels is smaller than \(\alpha=1-\)conf.int.

Sidak corrected confidence intervals

Tukey`s HSD procedure is based on pairwise comparisons of the differences between the means at each factor level and produces a set of corresponding confidence intervals. The Sidak procedure, on the other hand, addresses the problems of a type I error by lowering the acceptable probability of a type I error for all comparisons of the levels of the independent, categorical variable.

The Sidak corrected acceptable probability of error (Šidák 1967) is defined as \(\alpha_{Sidak}=1\)-conf.int\(^{1/M}\), where \(M=\frac{n\cdot (n-1)}{2}\) is the number of pairwise comparisons of the \(n\) levels of the categorical variable.

In the graphical display of One-way test and ANOVA, visstat() displays both the conf.level \(\cdot\; 100 \%\) confidence intervals alongside the larger, Sidak-corrected \((1-\alpha_{Sidak})\cdot 100\;\%\) confidence intervals.

Limitations

Note that the current structure of visstat() does not allow the study of interactions between the different levels of an independent variable.

Kruskal-Wallis test

If the p-value of the standardised residuals computed by shapiro.test() is smaller than the error probability 1-conf.level, visstat() chooses a non-parametric alternative, the Kruskal-Wallis rank sum test. kruskal.test() tests the null that the medians are equal at each group level. As post-hoc-analysis the pairwise Wilcoxon rank sum test pairwise.wilcox.test() is used, applying the default Holm method for multiple comparisons(Holm 1979). If the Holm-adjusted p-value for a pair is smaller than 1-confint, the green letters under the corresponding two box plots will differ. Otherwise the graphical representation of the Kruskal-Wallis test is similar to the Wilcoxon rank sum test described above. A list with the test statistics of the Kruskal-Wallis rank sum test and the p-values of the pairwise comparisons adjusted by the Holm method is returned.

Examples

One-way test:

The npk dataset reports the yield of peas in pounds/block on six blocks, where the application of nitrogen (N), phosphate (P) or potassium (K) fertilisers was varied. Either no, one, two or three different fertilisers were applied on the blocks.

oneway_npk <- visstat(npk, "yield", "block")

We can assume that the residuals are normally distributed based on the scatterplots of the standardised residuals, the normal quantile-quantile plot (Q-Q plot), and the p-values of both the Shapiro-Wilk test and the Anderson-Darling test. But at the given confidence level the homogeneity of variances cannot be assumed (\(p< \alpha\) as calculated with the bartlett.test()), and the p-value of the oneway.test() is displayed.

Post-hoc analysis with TukeyHSD() shows no significant difference between the yield between the different blocks (all green letters are equal).

ANOVA

The InsectSprays data gives the counts of insects in agricultural experimental units treated with six different insecticides. To stabilise the variance in counts, we transform the count data of the InsectSprays data set by the square root.

insect_sprays_tr <- InsectSprays
insect_sprays_tr$count_sqrt <- sqrt(InsectSprays$count)
visstat(insect_sprays_tr, "count_sqrt", "spray")

After the transformation, the homogeneity of variances can be assumed (\(p> \alpha\) as calculated with the bartlett.test()), and the p-value of the aov() is displayed.

Kruskal-Wallis rank sum test

The iris data set gives the measurement of the petal width in cm for three different iris species.

visstat(iris, "Petal.Width", "Species")

In the iris data example, the graphical analysis of the scatter plots of the standardised residuals as well as the Q-Q plot suggest that the residuals are not normally distributed. This visual inspection is confirmed by the very small p-values of the implemented tests of normality, the Shapiro-Wilk test and the Anderson-Darling test. If both p-values of the Shapiro-Wilk test and the Anderson-Darling test are smaller than \(\alpha=1-\)conf.int (as in the example above ), visstat switches to the non-parametric alternative kruskal.test(). Post-hoc analysis with pairwise.wilcox.test() reveals significant differences between the petal width of all three species (all green letters differ).

Linear Regression

If the feature varfactor and the response varsample have only one level of type numerical or integer, visstat() performs a simple linear regression.

Residual analysis

visstat() checks the normal distribution of the standardised residuals derived from lm() both graphically and with the Shapiro-Wilk and Anderson test (analogue to section Residual analysis). If the p-values of the null that the standardised residuals are normally distributed of both Shapiro-Wilk and Anderson test are smaller than 1-conf.int, the title of the residual plot will display the message “Requirement of normally distributed residuals not met”.

Regardless of the result of the residual analysis, visstat() performs the regression itself in the next step. The title of the graphical output indicates the chosen confidence level conf.level,the regression parameter with their confidence intervals and p-values, and the adjusted \(R^2\). The graph shows the raw data, the regression line and both the confidence and prediction bands corresponding to the chosen conf.level. visstat() returns a list with the test statistics of the linear regression, the p-values of the normality tests of the standardised residuals and the pointwise estimates of the confidence and prediction bands.

Examples

Data set: cars

The cars data set reports the speed of cars in mph and the distance (dist) in ft taken to stop.

linreg_cars <- visstat(cars, "dist", "speed")

Increasing the confidence level conf.level from the default 0.95 to 0.99 results in wider confidence and prediction bands:

linreg_cars <- visstat(cars, "dist", "speed", conf.level = 0.99)

p-values greater than conf.level in both Anderson-Darling normality test and the Shapiro-Wilk test of the standardised residuals indicate that the normality assumption of the residuals underlying the linear regression is met.

Data set: trees

The trees data set provides measurements of the diameter (called “Girth”) in inches and height in feet of black cherry trees.

linreg_cars <- visstat(trees, "Height", "Girth", conf.level = 0.9)

Both the graphical analysis of the standardised residuals and p-values less than conf.level in the Anderson-Darling normality test and the Shapiro-Wilk test of the standardised residuals suggest that the condition of normally distributed residuals of the regression model is not met. Furthermore the linear regression model explains only 24% of the total variance of the dependent variables “Height” of the cherry trees. The user might consider other regression models. These further tests are not provided by visstat().

\({\chi}^2\)- and Fisher Test

If both the feature varfactor and the response are varsample are categorical of type factor, visstat tests the null hypothesis, that the feature and the response are independent of each other.

Categorical data are usually presented as multidimensional arrays, called contingency tables. The values in each cell of the contingency table are the observed frequencies for each unique combination of feature and response. Based on the observed frequencies, visstat() calculates the expected frequencies under the null hypothesis.

If the expected frequencies are large, where large is defined as at least 80% of the expected frequencies being greater than 5 and none of the expected frequencies being less than 1, the chisqu.test() is performed, otherwise the fisher.test() [@Cochran]. In the case of 2-by-2 contingency tables, the continuity correction is applied to the thechisqu.test()`.

visstat() prints a grouped column plot with the p-value of the corresponding test in the title and a mosaic plot with Pearson’s residuals (for details see documentation of function mosaic() in the vcd package ) are generated.

Converting a contingency tables to a data.frame.

visstat() needs a data.frame with a column structure as input. Contingency tables can be transformed into this structure with the helper function counts_to_cases().

Examples based on data set: HairEyeColor

Creating a data.frame from a contingency table

counts_to_cases() transforms the contingency table HairEyeColor into data.frame named HairEyeColorDataFrame.

HairEyeColorDataFrame <- counts_to_cases(as.data.frame(HairEyeColor))

Pearson’s Chi-squared test —

hair_eye_color_df <- counts_to_cases(as.data.frame(HairEyeColor))
visstat(hair_eye_color_df, "Hair", "Eye")

Pearson’s Chi-squared test with Yate’s continuity correction

For 2 by 2 contingency tables, vistat() applies the continuity correction to the test statistics of the Chi-squared test.

In the following example, we select only participants with black or brown hair and brown or blue eyes resulting in a 2 by 2 contingency table.

hair_black_brown_eyes_brown_blue <- HairEyeColor[1:2, 1:2, ]
#Transform to data frame
hair_black_brown_eyes_brown_blue_df<- counts_to_cases(as.data.frame(hair_black_brown_eyes_brown_blue))
#Chi-squared test
visstat(hair_black_brown_eyes_brown_blue_df, "Hair", "Eye")

Fisher’s exact test and mosaic plot with Pearson residuals

Again, we cut a 2 x2 contingency table from the full data set, this time only keeping only male participants with black or brown hair and hazel or green eyes. Pearson’s Chi-squared on this 2 x2 contingency table would give to an expected value less than 5 in one of the four cells (corresponding to 25% of all cells), violating the requirement of the Chi-squared test that the expected value should be at least 5 for the majority (80%) of the cells (Cochran 1954). Therefore the Fisher exact test is chosen.

hair_eye_color_male <- HairEyeColor[, , 1]
# Slice out a 2 by 2 contingency table
black_brown_hazel_green_male <- hair_eye_color_male[1:2, 3:4]
#Transform to data frame
black_brown_hazel_green_male <- counts_to_cases(as.data.frame(black_brown_hazel_green_male))
# Fisher test
fisher_stats <- visstat(black_brown_hazel_green_male, "Hair", "Eye")

Saving the graphical output

The generated graphics can be saved in the file formats supported by Cairo(): “png”, “jpeg”, “pdf”, “svg”, “ps” and “tiff”. In the following example we store the graphics files in the output format “png” to the plotDirectory tempdir(). The naming convention used for the graphics file reflects the chosen statistical test and the variable names.

visstat(black_brown_hazel_green_male, "Hair", "Eye",
graphicsoutput = "png", plotDirectory =

tempdir()
)

Remove the graphical output from plotDirectory:

file.remove(file.path(tempdir(), "chi_squared_or_fisher_Hair_Eye.png"))
#> [1] TRUE
file.remove(file.path(tempdir(), "mosaic_complete_Hair_Eye.png"))
#> [1] TRUE

Overview of implemented tests

t.test(), wilcox.test(), aov(), kruskal.test(), lm(),fisher.test(), chisqu.test()

Implemented tests to check the normal distribution of standardised residuals

shapiro.test() and ad.test()

Implemented post-hoc tests

TukeyHSD() for aov() and pairwise.wilcox.test() for kruskal.test().

Bibliography

Cochran, William G. and. 1954. The combination of estimates from different experiments.” Scandinavian Journal of Statistics 10 (1): 101–20. https://doi.org/10.2307/3001666.
Holm, S. 1979. A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics 6: 65–70.
Lumley, Thomas, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The importance of the normality assumption in large public health data sets. Annual Review of Public Health 23: 151–69. https://doi.org/10.1146/annurev.publhealth.23.100901.140546.
Moser, B K, and GR Stevens. 1992. Homogeneity of variance in the two-sample means test.” The American Statistician, February, 19–21. https://doi.org/10.1080/00031305.1992.10475839.
Rasch, Dieter, Klaus D Kubinger, and Karl Moder. 2011. The two-sample t test: pre-testing its assumptions does not pay off.” Statistical Papers 52 (1): 219–31. https://doi.org/10.1007/s00362-009-0224-x.
Šidák, Zbyněk. 1967. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.” Journal of the American Statistical Association 62 (318): 626–33. https://doi.org/10.1080/01621459.1967.10482935.