Skip to contents

Abstract

The package visStatistics enables rapid visualization and statistical analysis of raw data by automatically selecting the hypothesis test with the highest statistical power to evaluate the relationship between a response (varsample) and a feature (varfactor) within a data.frame. Its core function, visstat(), generates a graph showing key statistics from the selected test and returns the corresponding test results, including any post-hoc analysis.

A minimal function call has the form visstat(dataframe, varsample, varfactor).

The input data.frame must be column-based, and the arguments varsample and varfactor must be character strings naming columns of the data.frame. The significance level α\alpha, used throughout for hypothesis testing, is defined as 1 - conf.level, where conf.level is a user-controllable argument (defaulting to 0.95). Data of class "numeric" or "integer" are referred to in the remainder of this vignette as numerical, while data of class "factor" are referred to as categorical.

The choice of statistical tests performed by the function visstat() depends on whether the data are numerical or categorical, the number of levels in the categorical variable, and the distribution of the data. The function prioritizes interpretable visual output and tests that remain valid under the assumptions evaluated applying the following decision logic:

  1. When the response is numerical and the predictor is categorical, test of central tendencies are performed. If the categorical predictor has two levels, Welch’s t-test (t.test()) is used, if both groups have more than 30 observations, relying on the central limit theorem to justify its application regardless of underlying normality (Lumley et al. 2002). For smaller samples, group-wise normality is assessed using the Shapiro-Wilk test (shapiro.test()). If both groups return p-values greater than α\alpha, Welch’s t-test is applied; otherwise, the Wilcoxon rank-sum test (wilcox.test()) is used. For predictors with more than two levels, an ANOVA model (aov()) is initially fitted. Residual normality is evaluated using both the Shapiro-Wilk test (shapiro.test()) and Anderson-Darling test (ad.test()), and normality is assumed if either test yields p>αp > \alpha. If the residuals are considered normal, Bartlett’s test (bartlett.test()) is used to assess homoscedasticity. When variances are homogeneous (p>αp > \alpha), ANOVA is applied with Tukey’s HSD (TukeyHSD()) for post-hoc comparison. If variances differ significantly (pαp \le \alpha), Welch’s one-way test (oneway.test()) is used, also followed by Tukey’s HSD. If residuals are not normally distributed according to both tests (pαp \le \alpha), the Kruskal-Wallis test (kruskal.test()) is selected, followed by pairwise Wilcoxon tests (pairwise.wilcox.test()).

  2. When both the response and predictor are numerical, a simple linear regression model (lm()) is fitted and analysed in detail, including residual diagnostics, formal tests, and the plotting of fitted values with confidence bands. Note that only one explanatory variable is allowed, as the function is designed for two-dimensional visualisation.

  3. In the case of two categorical variables, visstat() tests the null hypothesis that the predictor and response variables are independent using either chisq.test() or fisher.test(). The choice of test is based on Cochran’s rule (Cochran 1954), which advises that the χ2\chi^2 approximation is reliable only if no expected cell count is zero and no more than 20 percent of cells have expected counts below 5.

Note: Except for the user-adjustable conf.level parameter, all statistical tests are applied using their default settings from the corresponding base R functions (e.g., t.test()). As a consequence, paired tests are not currently supported. Furthermore, since the main purpose of this package is to visualize statistical test results, only simple linear regression is implemented.

Decision logic for selecting tests of central tendency

If the feature consists of classfactor” with two or more levels and the response consists is of classnumeric” or “integer” (both having mode “numeric”), statistical tests are applied to compare the central tendencies across groups. This section describes the conditions under which parametric and non-parametric tests are chosen, based on the response type, the number of factor levels, and the underlying distributional assumptions. A graphical overview of the decision logic used is provided in below figure.

knitr::include_graphics("../man/figures/decision_tree.png")
Decision tree used to select the appropriate statistical test for a categorical predictor and numerical response, based on the number of factor levels, normality and homoscedasticity.

Decision tree used to select the appropriate statistical test for a categorical predictor and numerical response, based on the number of factor levels, normality and homoscedasticity.

Two sample tests: Welch’s t-test or Wilcoxon rank sum test

When the feature variable has exactly two levels, Welch’s t-test is used to compare group means if the assumption of normality is satisfied. Otherwise, the Wilcoxon rank sum test is applied.

Welch’s t-test evaluates the null hypothesis that the two groups have equal means. Unlike Student’s t-test, which assumes equal variances (homoscedasticity), Welch’s test does not require this assumption. Welch’s test exhibits minimal loss of robustness even when Student’s t-test assumptions are satisfied (Moser and Stevens 1992; Delacre et al. 2017). Therefore, Student’s t-test is not implemented.

The Wilcoxon rank sum test, in contrast, is a non-parametric method that compares two distributions without requiring normality.

The function selects between these tests as follows. If both groups have more than 30 observations, Welch’s t-test (t.test()) is always performed. For large samples, the test remains robust even when normality is not strictly met (Rasch, Kubinger, and Moder 2011; Lumley et al. 2002).

If either group has fewer than 30 observations, the function applies the Shapiro-Wilk test (shapiro.test()) to each group. If both pp-values exceed the significance level α=1\alpha = 1 -conf.level, Welch’s t-test is applied. If at least one pp-value falls below α\alpha, the Wilcoxon rank sum test (wilcox.test()) is used instead.

The graphical output consists of box plots overlaid with jittered points to display individual observations. When Welch’s t-test is applied, the function includes confidence intervals based on the user-specified conf.level. The function returns a list containing the results of the applied test and the summary statistics used to construct the plot.

Examples

Welch’s t-test

As an example, we use the Motor Trend Car Road Tests data set (mtcars), which contains 32 observations. In the example below, mpg denotes miles per (US) gallon, and am represents the transmission type (0 = automatic, 1 = manual).

mtcars$am <- as.factor(mtcars$am)
t_test_statistics <- visstat(mtcars, "mpg", "am")

Increasing the confidence level conf.level from the default 0.95 to 0.99 results in wider confidence intervals, as a higher confidence level requires more conservative bounds to ensure that the interval includes the true parameter value with greater certainty.

mtcars$am <- as.factor(mtcars$am)
t_test_statistics_99 <- visstat(mtcars, "mpg", "am", conf.level = 0.99)

Wilcoxon rank sum test
grades_gender <- data.frame(
  sex = as.factor(c(rep("girl", 21), rep("boy", 23))),
  grade = c(
    19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4,
    20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.0,
    16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4,
    7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6, 17.3, 19.9, 4.4, 2.1
  )
)

wilcoxon_statistics <- visstat(grades_gender, "grade", "sex")

One-way test, ANOVA or Kruskal-Wallis test

If the feature is of classfactor” with more than two levels and the response is of modenumeric”, visstat() initially attempts an analysis of variance (ANOVA). ANOVA is performed only if both of the following null hypotheses cannot be rejected at the specified conf.level:

  1. The standardized residuals follow a normal distribution, and
  2. The residuals exhibit homoscedasticity (equal variances across groups).

If only the assumption of normality is satisfied, visstat() applies Welch’s one-way test (oneway.test()). If the normality assumption is violated, a Kruskal-W allis test (kruskal.test()) is used instead.

These assumptions are tested internally using the visAnovaAssumptions() function.

Checking the ANOVA assumptions

Residual analysis

The visAnovaAssumptions() function assesses the normality of standardized residuals from the ANOVA fit using both the Shapiro-W ilk test (shapiro.test()) and the Anderson-D arling test (ad.test()). Normality is assumed if at least one of the two tests yields a pp-value greater than 11 -conf.level.

The function generates two diagnostic plots:
- a scatterplot of the standardized residuals against the fitted means of the linear model for each level of the feature (varfactor), and
- a Q-Q plot of the standardized residuals.

Homoscedasticity: homogeneity of variances in each level: Bartlett test

Both aov() and oneway.test() assess whether two or more samples drawn from normal distributions have the same mean. While aov() assumes homogeneity of variances across groups, oneway.test() does not require the variances to be equal.

Homoscedasticity is assessed using Bartletts test (bartlett.test()`), which tests the null hypothesis that the variances across all levels of the grouping variable are equal.

One-way test and ANOVA

Depending on the pp-value returned by bartlett.test(), the corresponding test is selected and its pp-value is displayed in the figure title:

  • If the pp-value of bartlett.test() is greater than 11 -conf.level, we assume homogeneity of variances across groups and report the pp-value from the ANOVA (aov()).
  • Otherwise, homoscedasticity cannot be assumed, and the function reports the pp-value from Welch’s one-way test (oneway.test()).
Post-hoc analysis: Tukey’s Honestly Significant Differences (HSD) and Sidak-corrected confidence intervals

Simple pairwise comparisons of group means following an analysis of variance increase the probability of incorrectly declaring a significant difference when, in fact, there is none .

This inflation of error is quantified by the family-wise error rate, which refers to the probability of making at least one Type I error, that is, falsely rejecting the null hypothesis across all pairwise comparisons.

To control it, visstat() performs post-hoc analysis using Tukey’s Honestly Significant Difference (HSD) test and displays Sidak-corrected confidence intervals.

The visstat() function controls the probability of a Type I error by applying Tukey`s Honestly Significant Differences procedure, as implemented in TukeyHSD().
Based on the specified confidence level (conf.level), it constructs a set of confidence intervals for all pairwise differences between factor level means. A significant difference between two means is indicated when the corresponding confidence interval does not include zero.
The function returns both the HSD-adjusted pp-values and the associated confidence intervals for all pairwise comparisons.

In the graphical output for the one-way test and ANOVA, green letters displayed below each group summarize the results of the Tukey HSD post-hoc test: Two groups are considered significantly different if they are assigned different letters, indicating a Tukey HSD-adjusted pp-value smaller than α=1\alpha = 1 -conf.level.

Tukey`s HSD procedure is based on pairwise comparisons of the differences between the means at each factor level and produces a set of corresponding confidence intervals. The Sidak procedure, on the other hand, addresses the problems of a type I error by lowering the acceptable probability of a type I error for all comparisons of the levels of the independent, categorical variable.

The Sidak corrected acceptable probability of error (Šidák 1967) is defined as αSidak=(1\alpha_{Sidak}=(1-conf.int)1/M)^{1/M}, where M=n(n1)2M=\frac{n\cdot (n-1)}{2} is the number of pairwise comparisons of the nn levels of the categorical variable.

In the graphical display of One-way test and ANOVA, visstat() displays both the conf.level 100%\cdot\; 100 \%-confidence intervals alongside the larger, Sidak-corrected (1αSidak)100%(1-\alpha_{Sidak})\cdot 100\;\%-confidence intervals.

Note that the current structure of visstat() does not allow the study of interactions between the different levels of an independent variable.

Kruskal-Wallis test

If the pp-value from the Shapiro-Wilk test (shapiro.test()) applied to the standardized residuals is smaller than the significance level 11 -conf.level, visstat() selects a non-parametric alternative: the Kruskal-Wallis rank sum test.

The function kruskal.test() tests the null hypothesis that the group medians are equal across all levels of the categorical feature.

Post-hoc analysis: pairwise.wilcox.test()

As post-hoc analysis following the Kruskal-Wallis test, visstat() applies the pairwise Wilcoxon rank sum test using pairwise.wilcox.test(), with Holm-s method as the default adjustment for multiple comparisons (Holm 1979).

If the Holm-adjusted pp-value for a given pair of groups is smaller than the significance level 11 -conf.level, the green letters displayed below the corresponding box plots will differ. Otherwise, the groups are considered not significantly different.

Apart from the multiple comparison adjustment, the graphical representation of the Kruskal-Wallis result is similar to that used for the Wilcoxon rank sum test.

The function returns a list containing the test statistic from the Kruskal-Wallis rank sum test, along with the Holm-adjusted pp-values for all pairwise comparisons.

Examples

One-way test:

The npk dataset reports the yield of peas (in pounds per block) from an agricultural experiment conducted on six blocks. In this experiment, the application of three different fertilisers - nitrogen (N), phosphate (P), and potassium (K) - was varied systematically. Each block received either none, one, two, or all three of the fertilisers,

oneway_npk <- visstat(npk, "yield", "block")

Normality of residuals is supported by graphical diagnostics (scatter plot of standardized residuals, Q-Q plot) and formal tests (Shapiro-Wilk and Anderson-Darling, both with p>αp > \alpha). However, homogeneity of variances is not supported at the given confidence level (p<αp < \alpha, bartlett.test()), so the pp-value from the variance-robust oneway.test() is reported. Post-hoc analysis with TukeyHSD() shows no significant yield differences between blocks, as all share the same group label (e.g., all green letters).

ANOVA

The InsectSprays dataset reports insect counts from agricultural experimental units treated with six different insecticides. To stabilise the variance in counts, we apply a square root transformation to the response variable.

insect_sprays_tr <- InsectSprays
insect_sprays_tr$count_sqrt <- sqrt(InsectSprays$count)
visstat(insect_sprays_tr, "count_sqrt", "spray")

After the transformation, the homogeneity of variances can be assumed (p>αp> \alpha as calculated with the bartlett.test()), and the pp-value of the aov() is displayed.

Kruskal-Wallis rank sum test

The iris dataset contains petal width measurements (in cm) for three different iris species.

visstat(iris, "Petal.Width", "Species")

In this example, scatter plots of the standardised residuals and the Q-Q plot suggest that the residuals are not normally distributed. This is confirmed by very small pp-values from both the Shapiro-Wilk and Anderson-Darling tests.

If both pp-values are below the significance level α=1\alpha = 1 -conf.level, visstat() switches to the non-parametric kruskal.test(). Post-hoc analysis using pairwise.wilcox.test() shows significant differences in petal width between all three species, as indicated by distinct group labels (all green letters differ).

Simple linear regression

If the feature varfactor and the response varsample have only one level of type numerical or integer, visstat() performs a simple linear regression. Note that multiple linear regression is not implemented, as the main focus of visstat() is the visualisation of statistical tests.

If the feature varfactor and the response varsample contain only one level each and both are of type numeric or integer, visstat() performs a simple linear regression. Note that multiple linear regression is not implemented, as the main focus of visstat() is on the visualisation of statistical tests.

Residual analysis

visstat() checks the normality of the standardised residuals from lm() both graphically and using the Shapiro-Wilk and Anderson-Darling tests. If the pp-values for the null hypothesis of normally distributed residuals from both tests are smaller than 11 -conf.int, the title of the residual plot will display the message: “Requirement of normally distributed residuals not met”.

Regardless of the result of the residual analysis, visstat() proceeds to perform the regression. The title of the graphical output indicates the chosen confidence level (conf.level), the estimated regression parameters with their confidence intervals and pp-values, and the adjusted R2R^2. The plot displays the raw data, the fitted regression line, and both the confidence and prediction bands corresponding to the specified conf.level.

visstat() returns a list containing the regression test statistics, the pp-values from the normality tests of the standardised residuals, and the pointwise estimates of the confidence and prediction bands.

Examples

Data set: cars

The cars dataset reports the speed of cars in miles per hour (speed) and the stopping distance in feet (dist).

linreg_cars <- visstat(cars, "dist", "speed")

Increasing the confidence level conf.level from the default 0.95 to 0.99 results in wider confidence and prediction bands:

linreg_cars <- visstat(cars, "dist", "speed", conf.level = 0.99)

pp-values greater than conf.level in both Anderson-Darling normality test and the Shapiro-Wilk test of the standardised residuals indicate that the normality assumption of the residuals underlying the linear regression is met.

linreg_trees <- visstat(trees, "Volume", "Girth", conf.level = 0.9)

Data set: trees

The trees dataset provides measurements of the diameter (Girth, in inches), Height (in feet), and Volume (in cubic feet) of black cherry trees. In this example, we choose Volume as the response and Girth as the feature.

linreg_cars <- visstat(trees, "Volume", "Girth", conf.level = 0.9)

Both the graphical analysis of the standardised residuals and pp-values greater than 11 -conf.level in the Anderson-Darling and Shapiro-Wilk tests suggest that the assumption of normally distributed residuals is met. Furthermore, the linear regression model explains 93% of the total variance in the response variable Volume.

χ2{\chi}^2- and Fisher Test

When both the feature varfactor and the response varsample are categorical variables (i.e., of type factor), visstat() tests the null hypothesis that the two variables are independent. Categorical data are typically represented as contingency tables, which cross-tabulate the observed frequencies for each combination of factor levels. Based on these observed frequencies, visstat() computes the expected frequencies under the null hypothesis of independence.

If the expected frequencies are sufficiently large - specifically, if at least 80% of the cells have expected counts greater than 5 and no expected count is zero - the function uses Pearson’s χ2{\chi}^2-test (chisq.test()). Otherwise, it switches to Fisher’s exact test (fisher.test()) (Cochran 1954). For 2-by-2 contingency tables, Yate’s a continuity correction (Yates 1934) is applied to Pearson’s χ2{\chi}^2-test.

For both χ2{\chi}^2- and Fisher Test ‘visstat()’ displays a grouped column plot that includes the respective test’s pp-value in the title, as well as a mosaic plot showing Pearson residuals (see the documentation of mosaic() in the vcd package for details).

All features of visstat() will be illustrated using the built-in data set HairEyeColor.

Data preparation: Transforming a contingency table to data.frame

visstat() requires input in the form of a column-based data.frame. Contingency tables can be converted to this format using the helper function counts_to_cases().

counts_to_cases() transforms the contingency table HairEyeColor into data.frame named HairEyeColorDataFrame.

HairEyeColorDataFrame <- counts_to_cases(as.data.frame(HairEyeColor))

Pearson’s χ2{\chi}^2-test

hair_eye_color_df <- counts_to_cases(as.data.frame(HairEyeColor))
visstat(hair_eye_color_df, "Hair", "Eye")

Pearson’s χ2{\chi}^2-test with Yate’s continuity correction

For 2-by-2 contingency tables, visstat() applies Yate’s continuity correction (Yates 1934) when computing the χ2\chi^2 test statistic. In the following example, we restrict the data to participants with either black or brown hair and either brown or blue eyes, resulting in a 2-by-2 contingency table.

hair_black_brown_eyes_brown_blue <- HairEyeColor[1:2, 1:2, ]
# Transform to data frame
hair_black_brown_eyes_brown_blue_df <- counts_to_cases(as.data.frame(hair_black_brown_eyes_brown_blue))
# Chi-squared test
visstat(hair_black_brown_eyes_brown_blue_df, "Hair", "Eye")

Fisher’s exact test and mosaic plot with Pearson residuals

Again, we extract a 2-by-2 contingency table from the full dataset, this time keeping only male participants with black or brown hair and hazel or green eyes.

Pearson’s χ2{\chi}^2 test applied to this table would yield an expected frequency less than 5 in one of the four cells (25% of all cells), which violates the requirement that at least 80% of the expected frequencies must be 5 or greater (Cochran 1954).

Therefore, visstat() automatically selects Fisher’s exact test instead.

hair_eye_color_male <- HairEyeColor[, , 1]
# Slice out a 2 by 2 contingency table
black_brown_hazel_green_male <- hair_eye_color_male[1:2, 3:4]
# Transform to data frame
black_brown_hazel_green_male <- counts_to_cases(as.data.frame(black_brown_hazel_green_male))
# Fisher test
fisher_stats <- visstat(black_brown_hazel_green_male, "Hair", "Eye")

Saving the graphical output

The generated graphics can be saved in any file format supported by Cairo(), including “png”, “jpeg”, “pdf”, “svg”, “ps”, and “tiff”. In the following example, we store the graphics in PNG format in the plotDirectory tempdir(). The file names reflect the statistical test used and the variable names involved.

visstat(black_brown_hazel_green_male, "Hair", "Eye",
  graphicsoutput = "png", plotDirectory =

    tempdir()
)

Remove the graphical output from plotDirectory:

file.remove(file.path(tempdir(), "chi_squared_or_fisher_Hair_Eye.png"))
#> [1] TRUE
file.remove(file.path(tempdir(), "mosaic_complete_Hair_Eye.png"))
#> [1] TRUE

Implemented tests

Numerical response ~ categorical feature

When the response is numerical and the feature is categorical, test of central tendencies are selected:

t.test(), wilcox.test(), aov(), oneway.test(),kruskal.test()

Normality assumption check

shapiro.test() and ad.test()

Homoscedasticity assumption check

bartlett.test()

Numerical response ~ numerical feature

When both the response and feature are numerical, a simple linear regression model is fitted:

lm()

Categorical response ~ categorical predictor

When both variables are categorical, visstat() tests the null hypothesis of independence using one of the following:

Bibliography

Cochran, William G. and. 1954. The combination of estimates from different experiments.” Scandinavian Journal of Statistics 10 (1): 101–20. https://doi.org/10.2307/3001666.
Delacre, M, D Lakens, C Leys International Review of Social, and 2017. 2017. Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test.” International Review of Social Psychology. https://doi.org/10.5334/irsp.82.
Holm, S. 1979. A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics 6: 65–70.
Lumley, Thomas, Paula Diehr, Scott Emerson, and Lu Chen. 2002. The importance of the normality assumption in large public health data sets. Annual Review of Public Health 23: 151–69. https://doi.org/10.1146/annurev.publhealth.23.100901.140546.
Moser, B K, and GR Stevens. 1992. Homogeneity of variance in the two-sample means test.” The American Statistician, February, 19–21. https://doi.org/10.1080/00031305.1992.10475839.
Rasch, Dieter, Klaus D Kubinger, and Karl Moder. 2011. The two-sample t test: pre-testing its assumptions does not pay off.” Statistical Papers 52 (1): 219–31. https://doi.org/10.1007/s00362-009-0224-x.
Šidák, Zbyněk. 1967. Rectangular Confidence Regions for the Means of Multivariate Normal Distributions.” Journal of the American Statistical Association 62 (318): 626–33. https://doi.org/10.1080/01621459.1967.10482935.
Yates, F. 1934. “Contingency Tables Involving Small Numbers and the Χ2 Test.” Supplement to the Journal of the Royal Statistical Society 1 (2): 217–35. https://doi.org/10.2307/2983604.