Abstract
The R package visStatistics
provides means to quickly
visualise and analyse raw data by selecting, based on a decision tree,
the statistical hypothesis test with the highest statistical power
between the dependent variable (response) named varsample
and the independent variable (feature) named varfactor
in a
data.frame
named dataframe
. A minimal function
call has the structure:
visstat(dataframe,varsample,varfactor)
The data in the provided dataframe
must be structured by
column, where varsample
and varfactor
are the
character strings of the column names of the dependent (response) and
independent (feature) variables, respectively.
The choice of statistical tests performed by the core function
visstat()
depends on wether the data are numerical or
categorical, the number of levels of categorical data and the data
distributions (normal versus non-normal).
Data of class
"numeric"
or
"integer"
will be referred in the reminder of this vignette
as “numerical”, data of class
"factor"
will be
referred in the remainder of this vignette as “categorical”.
The function returns the corresponding test statistics, including any post-hoc-analysis, and generates a graph showing the key statistics of the underlying test.
visStatistics
provides a fully automated workflow. It
has been successfully used for the unbiased analysis of raw medical
data.
The remainder of this vignette focuses on the the algorithm
underlying the decision tree of vistat()
, while a call to
?visstat
documents all parameter settings of the
function.
All implemented statistical tests are called with their default
parameter sets, except for conf.level
, which can be
adjusted in the visstat()
function call. As a consequence,
no paired tests are available.
Comparing central tendencies
If the feature consists of data of class
"factor"
with two or more levels and the response consists
of data of class
"numeric"
or
"integer"
(both of mode "numeric"
), tests are
applied to compare the central tendencies.
Two sample tests: Welch’s t-test or Wilcoxon rank sum test
If the feature has exactly two levels,Welch’s t-test or its non-parametric alternative, the Wilcoxon rank sum test, is performed. Welch’s t-test tests the null hypothesis that the two samples have the same mean. While Student’s t-test requires homoscedasticity, Welch’s t-test does not. It loses little robustness compared to Student’s t-test when assumptions of Student’s t-test are met (Moser and Stevens 1992; Delacre:2017iv?). Therefore the implementation of Studen’t-test has been omitted. The null hypothesis of the Wilcoxon rank sum test is that both samples (levels) are drawn from the same population or have identical distributions.
The test choice follows the algorithm below:
If the sample size for both levels is greater than 30, always perform Welch’s t-test (
t.test()
) (@ Rasch, Kubinger, and Moder 2011; Lumley et al. 2002).If the sample size of at least one of the levels is smaller than 30, first check for normality of both levels with the Shapiro-Wilk normality test (
shapiro.test()
):If the p-values of the
shapiro.test()
of both levels are greater than the error probability \(\alpha = 1-\)conf.level
, perform the Welchs’ t-tes (t.test()
).If the p-value of at least one of the levels in the
shapiro.test()
is smaller than the error probability \(\alpha=1-\)conf.level
, a Wilcoxon rank sum test (wilcox.test()
) is executed.
The graphical representation consists of box plots overlaid with
jitter plots showing each data point. In the case of Welch’s t-test,
theconf.leve
\(\cdot 100
\%\) - confidence intervals are also shown. The test statistics
of the chosen test as well as the summary statistics of the generated
box plots are returned as a list.
Examples
Welch’s t-test
As an example we use the motor trend car road test data set
(mtcars)
, which consists of 32 observations. In the example
below mpg
denotes miles per US gallon, am
the
transmission type ((0 = automatic, 1 = manual)).
# t_test_statistics # Uncomment this line to print out the test statistics
Increasing the confidence level conf.level
from the
default 0.95 to 0.99 results in wider confidence intervals:
mtcars$am <- as.factor(mtcars$am)
t_test_statistics_99 <- visstat(mtcars, "mpg", "am", conf.level = 0.99)
Wilcoxon rank sum test
grades_gender <- data.frame(
sex = as.factor(c(rep("girl", 21), rep("boy", 23))),
grade = c(
19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4,
20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.0,
16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4,
7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6, 17.3, 19.9, 4.4, 2.1
)
)
wilcoxon_statistics <- visstat(grades_gender, "grade", "sex")
One-way test, ANOVA or Kruskal-Wallis test
If the feature consists of data of class
"factor"
with more than two levels and the response is of
mode
"numeric"
, visstat()
performs an analysis of variance (ANOVA),if both of the following null
hypotheses cannot be rejected on the chosen error of probability of
error \(\alpha=1-\)conf.level
Normality of the standardised residuals and
homoscedasticity.
If only the first condition for the normality of the residuals is
met, visstat()
performs a one-way test (see
oneway.test()
). If the normality of the residuals cannot be
assumed, a Kruskal-Wallis test (kruskal.test()
) is used.
These assumptions are tested by the visAnovaAssumptions()
function.
Checking the ANOVA assumptions
Residual analysis
The function visAnovaAssumptions()
checks the
standardised residuals of the ANOVA fit for normality using both the
Shapiro-Wilk-test shapiro.test()
and the Anderson-Darling
test ad.test()
. It plots the standardised residuals against
the fitted means of the linear model for each level of the feature
varfactor
and generates the Q-Q plot of the standardised
residuals.
Homoscedasticity: homogeneity of variances in each level: Bartlett test
Both aov()
and oneway.test()
test whether
two or more samples from normal distributions have the same mean. While
aov()
requires homogeneity of variances in each level
(group), the oneway.test()
does not require that the
variances in each level are necessarily equal. Homoscedasticity is
assessed using the Bartlett test, see bartlett.test()
,
under the null hypothesis that the variances in each of the levels are
equal.
One-way test and ANOVA
Depending on the p-value of the bartlett.test()
, the
corresponding test is shown in the figure title:
If the p-value of the
bartlett.test()
is greater than1-conf.level
, we assume homogeneity of variances in each level (group) and the p-values ofaov()
is displayed.Otherwise homoscedasticity cannot be assumed and the p-value of
oneway.test()
is reported.
Post-hoc analysis: Tukey`s honestly significant differences (HSD) and Sidak corrected confidence intervals
Simple multiple comparisons of the means for the factor levels in an analysis of variance inflate the probability of declaring a significant difference when, in fact, there is none . The family-wise error rate (also called the probability of a type I error) is the probability of at least one false positive comparison, in which the null hypothesis is falsely rejected, when multiple comparisons are performed.
Tukey`s honestly significant differences (HSD)
visstat()
reduces this probability of a type I error by
using Tukey’s honestly significant differences (HSD, see
TukeyHSD()
). It creates a set of confidence intervals on
the differences between the means per factor level with the specified
family-wise probability conf
. If a confidence interval does
not include the zero, then there is a significant difference between the
pair. The set of confidence intervals for all pairwise comparisons is
returned along with the Tukey HSD adjusted p-values.
In the graphical representation of One-way test and ANOVA, the green
letters between two levels differ only, if the Tukey’s HSD corrected
p-value for these two levels is smaller than \(\alpha=1-\)conf.int
.
Sidak corrected confidence intervals
Tukey`s HSD procedure is based on pairwise comparisons of the differences between the means at each factor level and produces a set of corresponding confidence intervals. The Sidak procedure, on the other hand, addresses the problems of a type I error by lowering the acceptable probability of a type I error for all comparisons of the levels of the independent, categorical variable.
The Sidak corrected acceptable probability of error (Šidák 1967) is defined as \(\alpha_{Sidak}=1\)-conf.int
\(^{1/M}\), where \(M=\frac{n\cdot (n-1)}{2}\) is the number of
pairwise comparisons of the \(n\)
levels of the categorical variable.
In the graphical display of One-way test and ANOVA,
visstat()
displays both the conf.level
\(\cdot\; 100 \%\) confidence intervals
alongside the larger, Sidak-corrected \((1-\alpha_{Sidak})\cdot 100\;\%\)
confidence intervals.
Limitations
Note that the current structure of visstat()
does not
allow the study of interactions between the different levels of an
independent variable.
Kruskal-Wallis test
If the p-value of the standardised residuals computed by
shapiro.test()
is smaller than the error probability
1-conf.level
, visstat()
chooses a
non-parametric alternative, the Kruskal-Wallis rank sum test.
kruskal.test()
tests the null that the medians are equal at
each group level. As post-hoc-analysis the pairwise Wilcoxon rank sum
test pairwise.wilcox.test()
is used, applying the default
Holm method for multiple comparisons(Holm
1979). If the Holm-adjusted p-value for a pair is smaller than
1-confint
, the green letters under the corresponding two
box plots will differ. Otherwise the graphical representation of the
Kruskal-Wallis test is similar to the Wilcoxon rank sum test described
above. A list with the test statistics of the Kruskal-Wallis rank sum
test and the p-values of the pairwise comparisons adjusted by the Holm
method is returned.
Examples
One-way test:
The npk
dataset reports the yield of peas in
pounds/block on six blocks, where the application of nitrogen (N),
phosphate (P) or potassium (K) fertilisers was varied. Either no, one,
two or three different fertilisers were applied on the blocks.
oneway_npk <- visstat(npk, "yield", "block")
We can assume that the residuals are normally distributed based on
the scatterplots of the standardised residuals, the normal
quantile-quantile plot (Q-Q plot), and the p-values of both the
Shapiro-Wilk test and the Anderson-Darling test. But at the given
confidence level the homogeneity of variances cannot be assumed (\(p< \alpha\) as calculated with the
bartlett.test()
), and the p-value of the
oneway.test()
is displayed.
Post-hoc analysis with TukeyHSD()
shows no significant
difference between the yield between the different blocks (all green
letters are equal).
ANOVA
The InsectSprays data gives the counts of insects in agricultural
experimental units treated with six different insecticides. To stabilise
the variance in counts, we transform the count data of the
InsectSprays
data set by the square root.
insect_sprays_tr <- InsectSprays
insect_sprays_tr$count_sqrt <- sqrt(InsectSprays$count)
visstat(insect_sprays_tr, "count_sqrt", "spray")
After the transformation, the homogeneity of variances can be assumed
(\(p> \alpha\) as calculated with
the
bartlett.test()
), and the p-value of the
aov()
is displayed.
Kruskal-Wallis rank sum test
The iris data set gives the measurement of the petal width in cm for three different iris species.
visstat(iris, "Petal.Width", "Species")
In the iris data example, the graphical analysis of the scatter plots
of the standardised residuals as well as the Q-Q plot suggest that the
residuals are not normally distributed. This visual inspection is
confirmed by the very small p-values of the implemented tests of
normality, the Shapiro-Wilk test and the Anderson-Darling test. If
both p-values of the Shapiro-Wilk test and the
Anderson-Darling test are smaller than \(\alpha=1-\)conf.int
(as in the
example above ), visstat
switches to the non-parametric
alternative kruskal.test()
. Post-hoc analysis with
pairwise.wilcox.test()
reveals significant differences
between the petal width of all three species (all green letters
differ).
Linear Regression
If the feature varfactor
and the response
varsample
have only one level of type
numerical
or integer
, visstat()
performs a simple linear regression.
Residual analysis
visstat()
checks the normal distribution of the
standardised residuals derived from lm()
both graphically
and with the Shapiro-Wilk and Anderson test (analogue to section Residual analysis). If the p-values of the
null that the standardised residuals are normally distributed of both
Shapiro-Wilk and Anderson test are smaller than 1-conf.int
,
the title of the residual plot will display the message “Requirement of
normally distributed residuals not met”.
Regardless of the result of the residual analysis,
visstat()
performs the regression itself in the next step.
The title of the graphical output indicates the chosen confidence level
conf.level
,the regression parameter with their confidence
intervals and p-values, and the adjusted \(R^2\). The graph shows the raw data, the
regression line and both the confidence and prediction bands
corresponding to the chosen conf.level. visstat()
returns a
list with the test statistics of the linear regression, the p-values of
the normality tests of the standardised residuals and the pointwise
estimates of the confidence and prediction bands.
Examples
Data set: cars
The cars data set reports the speed of cars in mph and the distance (dist) in ft taken to stop.
linreg_cars <- visstat(cars, "dist", "speed")
Increasing the confidence level conf.level
from the
default 0.95 to 0.99 results in wider confidence and prediction
bands:
linreg_cars <- visstat(cars, "dist", "speed", conf.level = 0.99)
p-values greater than conf.level
in both
Anderson-Darling normality test and the Shapiro-Wilk test of the
standardised residuals indicate that the normality assumption of the
residuals underlying the linear regression is met.
Data set: trees
The trees data set provides measurements of the diameter (called “Girth”) in inches and height in feet of black cherry trees.
linreg_cars <- visstat(trees, "Height", "Girth", conf.level = 0.9)
Both the graphical analysis of the standardised residuals and
p-values less than conf.level
in the Anderson-Darling
normality test and the Shapiro-Wilk test of the standardised residuals
suggest that the condition of normally distributed residuals of the
regression model is not met. Furthermore the linear regression model
explains only 24% of the total variance of the dependent variables
“Height” of the cherry trees. The user might consider other regression
models. These further tests are not provided by
visstat()
.
\({\chi}^2\)- and Fisher Test
If both the feature varfactor
and the response are
varsample
are categorical of type factor
,
visstat
tests the null hypothesis, that the feature and the
response are independent of each other.
Categorical data are usually presented as multidimensional arrays,
called contingency tables. The values in each cell of the contingency
table are the observed frequencies for each unique combination of
feature and response. Based on the observed frequencies,
visstat()
calculates the expected frequencies under the
null hypothesis.
If the expected frequencies are large, where large is defined as at
least 80% of the expected frequencies being greater than 5 and none of
the expected frequencies being less than 1, the
chisqu.test()
is performed, otherwise the
fisher.test() [@Cochran]. In the case of 2-by-2 contingency tables, the continuity correction is applied to the the
chisqu.test()`.
visstat()
prints a grouped column plot with the p-value
of the corresponding test in the title and a mosaic plot with Pearson’s
residuals (for details see documentation of function
mosaic()
in the vcd
package ) are
generated.
Converting a contingency tables to a data.frame
.
visstat()
needs a data.frame
with a column
structure as input. Contingency tables can be transformed into this
structure with the helper function counts_to_cases()
.
Examples based on data set: HairEyeColor
Creating a data.frame
from a contingency table
counts_to_cases()
transforms the contingency table
HairEyeColor
into data.frame
named
HairEyeColorDataFrame
.
HairEyeColorDataFrame <- counts_to_cases(as.data.frame(HairEyeColor))
Pearson’s Chi-squared test —
hair_eye_color_df <- counts_to_cases(as.data.frame(HairEyeColor))
visstat(hair_eye_color_df, "Hair", "Eye")
Pearson’s Chi-squared test with Yate’s continuity correction
For 2 by 2 contingency tables, vistat()
applies the
continuity correction to the test statistics of the Chi-squared
test.
In the following example, we select only participants with black or brown hair and brown or blue eyes resulting in a 2 by 2 contingency table.
hair_black_brown_eyes_brown_blue <- HairEyeColor[1:2, 1:2, ]
#Transform to data frame
hair_black_brown_eyes_brown_blue_df<- counts_to_cases(as.data.frame(hair_black_brown_eyes_brown_blue))
#Chi-squared test
visstat(hair_black_brown_eyes_brown_blue_df, "Hair", "Eye")
Fisher’s exact test and mosaic plot with Pearson residuals
Again, we cut a 2 x2 contingency table from the full data set, this time only keeping only male participants with black or brown hair and hazel or green eyes. Pearson’s Chi-squared on this 2 x2 contingency table would give to an expected value less than 5 in one of the four cells (corresponding to 25% of all cells), violating the requirement of the Chi-squared test that the expected value should be at least 5 for the majority (80%) of the cells (Cochran 1954). Therefore the Fisher exact test is chosen.
hair_eye_color_male <- HairEyeColor[, , 1]
# Slice out a 2 by 2 contingency table
black_brown_hazel_green_male <- hair_eye_color_male[1:2, 3:4]
#Transform to data frame
black_brown_hazel_green_male <- counts_to_cases(as.data.frame(black_brown_hazel_green_male))
# Fisher test
fisher_stats <- visstat(black_brown_hazel_green_male, "Hair", "Eye")
Saving the graphical output
The generated graphics can be saved in the file formats supported by
Cairo():
“png”, “jpeg”, “pdf”, “svg”, “ps” and “tiff”. In
the following example we store the graphics files in the output format
“png” to the plotDirectory
tempdir()
. The
naming convention used for the graphics file reflects the chosen
statistical test and the variable names.
visstat(black_brown_hazel_green_male, "Hair", "Eye",
graphicsoutput = "png", plotDirectory =
tempdir()
)
Remove the graphical output from plotDirectory
:
file.remove(file.path(tempdir(), "chi_squared_or_fisher_Hair_Eye.png"))
#> [1] TRUE
file.remove(file.path(tempdir(), "mosaic_complete_Hair_Eye.png"))
#> [1] TRUE
Overview of implemented tests
t.test()
, wilcox.test()
,
aov()
, kruskal.test()
,
lm()
,fisher.test()
,
chisqu.test()
Implemented tests to check the normal distribution of standardised residuals
shapiro.test()
and ad.test()
Implemented post-hoc tests
TukeyHSD()
for aov()
and
pairwise.wilcox.test()
for kruskal.test()
.