Skip to contents

Abstract

visStatistics automatically selects and visualises appropriate statistical tests between two column vectors of class "numeric", "integer", or "factor". No manual test selection is required: the function determines the appropriate method from the data types and distributional properties alone. The choice of test depends on the class, distributional assumptions, and sample size of the vectors.

The main function visstat() visualises the selected test with appropriate graphs (box plots, bar charts, regression lines with confidence bands, mosaic plots for Pearson’s χ2\chi^2 test, residual plots), annotated with the main test results, including visualisations of the assumption checks and post-hoc analyses.

This scripted workflow is well suited for browser-based applications, where users interact only through a web interface, while server-side R applications handle the data processing.

Other typical use cases include quick visualisations, and guided statistical test selection, for example in statistical consulting or educational settings.

Introduction

Most routine data analyses reduce to a comparatively small set of inferential frameworks, including group comparisons, contingency-table analyses, and regression models (Sato et al. 2017; Chicco et al. 2025). visStatistics selects out of these most commonly used tests “right test” by using a reproducible decision framework based on the data’s distributional assumptions and the sample size of the input data. It visualises its decision by, where appropriate, assumption-diagnostic plot and a descriptive plot with the main test statistics annotated, and returns an R object whose print() and summary() methods expose the complete test results. The scripted workflow is well suited for browser-based applications where sensitive data (such as highly confidential medical records) is stored securely on a server and can not be directly accessed by users. This approach was already successfully applied to develop a medical scoring tool (Bijlenga et al. 2017). A package with similar scope, compareGroups (Subirana et al. 2014), also automates test choice by evaluating normality per group. In contrast, visStatistics assesses the normality of the overall model residuals, adhering to the general linear model framework (Searle 1971). This avoids the loss of statistical power inherent in group-wise testing, which often leads to the unnecessary rejection of parametric methods.

Getting started

1. Install the latest development version from GITHUB
install_github("shhschilling/visStatistics")
2. Load the package
3. Minimal function call

The function visstat() accepts input in three ways:

# Standardised form (recommended):
visstat(x, y)

# Formula interface:
visstat(y ~ x, data = dataframe)

# Backward-compatible form:
visstat(dataframe, "namey", "namex")

x and y must be vectors of class "numeric", "integer", or "factor".

In the formula interface, y ~ x specifies the relationship, where y is the response variable and x is the predictor, with both being column names in dataframe.

In the backward-compatible form, "namex" and "namey" must be character strings naming columns in dataframe, which must themselves be of class "numeric", "integer", or "factor". Note that in the backward-compatible form the second argument belongs to the response, the third to the predictor. This is equivalent to writing:

visstat(dataframe[["namex"]], dataframe[["namey"]])
Exemplary function call
#Standardized form
visstat(npk$block,npk$yield)
# Using formula interface
 visstat(yield~block,data=npk)
#Backward-compatible form
 visstat(npk,"yield","block")

Decision logic

Throughout the remainder, data of class "numeric" or "integer" are referred to by their common mode numeric, while data of class "factor" are referred to as categorical. The significance level α\alpha, used throughout for hypothesis testing, is defined as 1 - conf.level, where conf.level is a user-controllable argument (defaulting to 0.95).

The choice of statistical tests performed by the function visstat() depends on whether the data are numeric or categorical, the number of levels in the categorical variable, the distribution of the data, as well as the user-defined ‘conf.level’. The common mathematical framework underlying Student’s t-test, Fisher’s ANOVA and simple linear regression is described in the Appendix A.

The function prioritizes interpretable visual output and tests that remain valid under the following decision logic. The following graph gives an overview of all implemented tests based on their class:

Overview of implemented tests .

Overview of the implemented statistical tests based on the class of the variables.

A graphical summary of the decision logic used for numerical responses and categorical predictors resulting in comparisons of central tendencies is given in the figure below.

Decision tree used to select the appropriate statistical test.

Decision tree used to select the appropriate statistical test for a categorical predictor and numeric response, based on the sample sizes, number of factor levels, normality, and homoscedasticity.

Numeric response and categorical predictor: Comparing central tendencies

When the response y is numeric and the predictor x is categorical, a statistical hypothesis test comparing central tendencies is selected. To check whether the assumptions of a general linear model (see Appendix A) are fulfilled, a linear model lm(y ~ x) is first fitted.

  • Normality testing of the residuals: The Shapiro–Wilk test (Shapiro and Wilk 1965) (shapiro.test()) is performed on internally studentized residuals computed by rstandard(), which scales the raw residuals with an estimate of their standard deviation that accounts for the leverage of each observation (Cook and Weisberg 1982). The Shapiro–Wilk (SW) test is the most powerful for detecting non-normality across most distributions, especially with smaller sample sizes (Razali and Wah 2011; Ghasemi and Zahediasl 2012).

  • Non parametric-tests: Only when SW the test rejects normality (pSWαp_{SW} \le \alpha), non-parametric tests are selected: the Wilcoxon rank-sum test (wilcox.test()) for two groups, or the Kruskal–Wallis test (kruskal.test()) followed by the Holm adjusted pairwise Wilcoxon tests (pairwise.wilcox.test()) for more than two groups.

  • Parametric tests: When there is insufficient evidence against normality (pSW>αp_{SW} > \alpha) or when the sample sizes are large (more than 50 observations per group), parametric tests are selected. In large samples the central limit theorem ensures approximate normality of the sampling distribution of the mean. (Rasch et al. 2011; Lumley et al. 2002; Kwak and Kim 2017)

    • Homoscedasticity: The Levene–Brown–Forsythe (L) test (implemented as levene.test()) (Brown and Forsythe 1974) tests for homogeneous variances.
  • Regardless of sample size, assumption diagnostics are always displayed. Throughout this vignette, linear model refers to the classical Gaussian linear model framework (Searle 1971) underlying t-tests, ANOVA, and linear regression. When normality of residuals is met but homoscedasticity is rejected, visstat() selects Welch methods and additionally displays group-wise normality diagnostics via vis_group_normality(), since Welch’s variance estimates are group-specific and the pooled lm() residuals are no longer the appropriate diagnostic under heteroscedasticity. These diagnostic plots enable users to visually assess whether assumptions are met and manually override the automated p-value-based test selection.

Both variables numeric: Simple linear regression (default) or Spearman rank correlation (correlation = TRUE)

Simple linear regression (lm())

By default, visstat() always fits a simple linear regression model (lm()) for two numeric variables, regardless of whether the GLM assumptions of normality and homoscedasticity are met. Assumption diagnostics are always shown, enabling the user to assess whether the model is appropriate. Correlation analysis is never triggered automatically; it requires the explicit flag correlation = TRUE. Note that only one predictor variable is allowed, as the function is designed for two-dimensional visualisation.

In the linear regression branch, homoscedasticity is assessed using the Breusch–Pagan test bp.test() (see Appendix B), which evaluates whether the variance of the raw residuals depends on the predictor variable.

Spearman rank correlation (correlation = TRUE)

When correlation = TRUE is set, visstat() uses Spearman’s ρ\rho (cor.test(..., method = "spearman")) to measure the monotone association between the two numeric variables. Switching to correlation requires an explicit user choice, because the decision between modelling a directional relationship (regression) and measuring a monotone association (correlation) cannot be derived from the data type alone. Spearman correlation operates on the ranks of the data rather than the original values, making it robust to outliers and non-normal distributions while detecting monotonic relationships. Because Spearman’s ρ\rho is Pearson’s rr applied to the ranks, it yields nearly identical results to Pearson correlation when the data are bivariate normal (the assumption of Pearson’s inference) but remains valid without any distributional assumptions. A separate Pearson branch is therefore not implemented. When regression assumptions are violated, Spearman rank correlation offers a robust, assumption-free alternative — at the cost of causal interpretability. For alternatives not implemented in this package, see Limitations.

Both variables of class factor

Comparing proportions (Chi-squared or Fisher’s exact test)

When both variables are categorical (and not ordered), no direction is assumed; the order of variables in the function call does not affect the test statistic, but it does influence the graphical output. For consistency, we continue referring to the variables as predictor and response.

visstat() tests the null hypothesis that the variables are independent using either Pearson’s χ2\chi^2 test (chisq.test()) or Fisher’s exact test (fisher.test()), depending on expected cell counts. The choice of test is based on Cochran’s rule (Cochran 1954), which advises that the χ2\chi^2 approximation is reliable only if no expected cell count is less than 1 and no more than 20 percent of cells have expected counts below 5.

Response of class ordered, predictor of class factor

When the response variable is an ordered factor (e.g., Likert scale ratings) and the predictor is categorical, visstat() converts the ordered response to integer level codes and redirects the analysis to the non-parametric wilcox.test() (predictor with two levels) or kruskal.test() (predictor with more than two levels) respectively, as the numeric distances between ordered categories are not necessarily equal and the data cannot be assumed to follow a normal distribution.

Both variables of class factor and ordered: Wilcoxon/Kruskal-Wallis (default) or Kendall’s τb\tau_b (correlation = TRUE)

When correlation = FALSE (default) and both variables are ordered factors, the response is converted to integer level codes and the analysis follows the standard non-parametric path (Wilcoxon or Kruskal–Wallis). When the research question concerns a monotone trend between both ordered variables rather than group differences, the user should consider switching to correlation = TRUE.

Kendall rank correlation (correlation = TRUE)

When correlation = TRUE is set, visstat() converts both ordered factors to integer level codes, which are used directly as ranks, and tests for a monotone association via Kendall’s τb\tau_b rank correlation (Kendall 1945; Agresti 2010). Kendall’s τb\tau_b is preferred over Spearman’s ρ\rho for ordinal data with few levels, where ties are common: τb\tau_b corrects for ties explicitly, while Spearman’s ρ\rho uses only an approximate adjustment Xu et al. (2013). The visualisation is a jittered rank–rank scatter that makes the monotone trend visible, annotated with τb\tau_b and the pp-value.

Assumption diagnostics: vis_lm_assumptions()

All tests within the linear model framework (see Appendix A) share the same set of assumptions:

  • Linearity: The expected value of the response is a linear function of the predictors; assessed by checking for systematic patterns in residual plots.

  • Error terms are independent, normally distributed with expectation value 0 and have constant error variance σ2\sigma^2

The function vis_lm_assumptions() provides a unified visual and statistical framework for validating the linear model assumptions. It fits a linear model using aov() and provides four standard diagnostic plots:

  1. Histogram and Normal Density: Displays the distribution of standardized residuals with a red normal density curve overlay to visually assess normality.

  2. Std. Residuals vs. Fitted: It shows the standardized residuals against fitted values with a zero-line to detect non-linearity or heteroscedasticity.

  3. Normal Q-Q Plot: A theoretical quantile-quantile plot using the standardized residuals to identify deviations from the Gaussian distribution.

  4. If correlation = FALSE (regression mode), the Standardized Residuals vs. Leverage plot; if correlation = TRUE (correlation mode), a Scale-Location plot.

Test for normality and homoscedasticity

The diagnostic plots are enhanced by p-values of tests for normality and homoscedasticity, shown in the title of the plot.

Normality Assessment: The function evaluates residual normality using both the Shapiro–Wilk test (shapiro.test()) and the Anderson–Darling test (ad.test()) (Gross and Ligges 2015). Anderson–Darling is particularly sensitive to tail deviations (Razali and Wah 2011; Yap and Sim 2011).

Homoscedasticity Assessment: Variance equality is tested using the Levene-Brown-Forsythe test (levene.test()) (Brown and Forsythe 1974) and Bartlett’s test (bartlett.test()) in the ANOVA path.

These tests compare the spread of data across different factor levels.

For simple linear regression, the Breusch-Pagan test (bp.test()) is implemented.

These three tests have standalone implementations in the package to avoid dependencies on external packages (see Appendix B).

Implemented tests with examples

For all implemented tests, this section provides mathematical background and references. We report the definition of the test statistic, its distribution under the null hypothesis, and any relevant assumptions.

Numeric response and categorical predictor: Comparing central tendencies

Categorical predictor with two levels: Welch’s t-test and Wilcoxon rank-sum

Student’s t-test (t.test(var.equal = TRUE))

When var.equal = TRUE, R’s t.test() reports its method as "Two Sample t-test"; this is the same procedure that is commonly known as Student’s t-test, and visstat() displays the R wording verbatim in the plot title.

The test statistic for Student’s t-test is given by:

t=x1x2sp1n1+1n2,t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}},

where x1\bar{x}_1 and x2\bar{x}_2 are the sample means, n1n_1 and n2n_2 are the sample sizes, and sps_p is the pooled standard deviation:

sp=(n11)s12+(n21)s22n1+n22,s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}},

where s12s_1^2 and s22s_2^2 are the sample variances in the two groups. The test statistic follows a t-distribution with ν=n1+n22\nu = n_1 + n_2 - 2 degrees of freedom.

Welch’s t-test (t.test())

Welch’s t-test relaxes the homoscedasticity assumption while maintaining the requirements for independent observations and normally distributed residuals. It evaluates the null hypothesis that the means of two groups are equal without assuming equal variances.

The test statistic is given by (Welch 1947; Satterthwaite 1946)

t=x1x2s12n1+s22n2,t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}},

where x1\bar{x}_1 and x2\bar{x}_2 are the sample means, s12s_1^2 and s22s_2^2 the sample variances, and n1n_1, n2n_2 the sample sizes in the two groups. The statistic follows a t-distribution with degrees of freedom approximated by the Welch-Satterthwaite equation:

ν(s12n1+s22n2)2(s12/n1)2n11+(s22/n2)2n21.\nu \approx \frac{ \left( \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} \right)^2 }{ \frac{(s_1^2 / n_1)^2}{n_1 - 1} + \frac{(s_2^2 / n_2)^2}{n_2 - 1} }.

The resulting p-value is computed from the t-distribution with ν\nu degrees of freedom.

Wilcoxon rank-sum test (wilcox.test())

The two-sample Wilcoxon rank-sum test (also known as the Mann-Whitney test) is a non-parametric alternative that does not require the response variable to be approximately normally distributed within each group. It tests for a difference in location between two independent distributions (Mann and Whitney 1947). If the two groups have distributions that are sufficiently similar in shape and scale, the Wilcoxon rank-sum test can be interpreted as testing whether the medians of the two populations are equal (Hollander et al. 2014).

The two-level factor variable x defines two groups, with sample sizes n1n_1 and n2n_2. All N=n1+n2N=n_1 + n_2 observations are pooled and assigned ranks from 11 to NN. Let W1W_1 denote the sum of the ranks assigned to the group x1x_1 corresponding to the first level of x containing n1n_1 observations:

W1=i=1n1R(x1,i)W_{1}= \sum_{i=1}^{n_1} R(x_{1,i}),

where R(x1,i)R(x_{1,i}) is the rank of observation x1,ix_{1,i} in the pooled sample.

The test statistic WW returned by wilcox.test() is then computed as

W=U1=W1n1(n1+1)2.W =U_{1}=W_{1} - \frac{n_1(n_1 + 1)}{2}. It corresponds to the Mann-Whitney (Mann and Whitney 1947) UU statistic of the first group.

If both groups contain fewer than 50 observations and the data contain no ties, the p-value is computed exactly. Otherwise, a normal approximation with continuity correction is used.

The function returns a list containing the results of the applied test and the summary statistics used to construct the plot.

Examples

Welch’s t-test

The Motor Trend Car Road Tests dataset (mtcars) contains 32 observations, where mpg denotes miles per (US) gallon, and am represents the transmission type (0 = automatic, 1 = manual).

mtcars$am <- as.factor(mtcars$am)
t_test_statistics <- visstat(mtcars$am, mtcars$mpg)

Wilcoxon rank sum test

The Wilcoxon rank sum test is exemplified on differences between the central tendencies of grades of “boys” and “girls” in a class:

grades_gender <- data.frame(
  sex = as.factor(c(rep("girl", 21), rep("boy", 23))),
  grade = c(
    19.3, 18.1, 15.2, 18.3, 7.9, 6.2, 19.4,
    20.3, 9.3, 11.3, 18.2, 17.5, 10.2, 20.1, 13.3, 17.2, 15.1, 16.2, 17.0,
    16.5, 5.1, 15.3, 17.1, 14.8, 15.4, 14.4, 7.5, 15.5, 6.0, 17.4,
    7.3, 14.3, 13.5, 8.0, 19.5, 13.4, 17.9, 17.7, 16.4, 15.6, 17.3, 19.9, 4.4, 2.1
  )
)

wilcoxon_statistics <- visstat(grades_gender$sex, grades_gender$grade)

Wilcoxon with ordinal response
set.seed(123)

# Create predictor: Customer segment (2 groups)
segment <- factor(rep(c("Budget", "Premium"), each = 50))

# Create response: Likert scale ratings (1-5)
satisfaction_numeric <- c(
  sample(1:5, 50, replace = TRUE, prob = c(0.15, 0.25, 0.30, 0.20, 0.10)),  # Budget
  sample(1:5, 50, replace = TRUE, prob = c(0.05, 0.10, 0.20, 0.35, 0.30))   # Premium
)

# Create dataframe with ORDERED response
survey_data <- data.frame(
  segment = segment,
  satisfaction = ordered(satisfaction_numeric)  # Declare as ordered
)

# triggers warnings and use Wilcoxon test
wilcox_ordered <- visstat(survey_data, "satisfaction", "segment")
## Warning: Ordered response detected. Converting to integer level codes for
## non-parametric analysis.

Categorical predictor with more than two levels

Fisher’s one-way ANOVA (aov())

Fisher’s one-way ANOVA (aov()) tests the null hypothesis that the means of kk groups are equal.

As a linear model, it assumes independent observations, normally distributed residuals, and homogeneous variances across groups. The test statistic is the ratio of the variance explained by differences among group means (between-group variance) to the unexplained variance within groups (Fisher and Yates 1990)

F=MSbetweenMSwithin=SSbetween/(k1)SSwithin/(Nk)=i=1kni(xix)2k1i=1kj=1ni(xijxi)2NkF = \frac{MS_{between}}{MS_{within}}= \frac{SS_{between}/(k-1)}{SS_{within}/(N-k)} = \frac{\frac{\sum_{i=1}^{k} n_i (\bar{x}_i - \bar{x})^2}{k - 1}} {\frac{\sum_{i=1}^{k}\sum_{j=1}^{n_i}(x_{ij}-\bar{x}_i)^2}{N - k}}

where:

  • MSbetweenMS_{between} and MSwithinMS_{within} are the mean square between groups and mean square within groups, respectively.

  • SSbetweenSS_{between} = Sum of Squares between groups (variance due to group differences)

  • SSwithinSS_{within} = Sum of Squares within groups (error variance)

  • kk = number of groups

  • NN = total sample size

xi\bar{x}_i is the mean of group ii, x\bar{x} is the overall mean, xijx_{ij} is the observation jj in group ii, nin_i is the sample size in group ii, kk is the number of groups, and NN is the total number of observations.

Under the null hypothesis, this statistic follows an F-distribution with two parameters for degrees of freedom: (k1)(k - 1) and (Nk)(N - k): FF(k1,Nk)F \sim F(k-1, N-k) The resulting p-value is computed from this distribution.

Welch’s heteroscedastic one-way ANOVA (oneway.test())

When only the assumptions of independent observations and normally distributed residuals are met, but homogeneous variances across groups cannot be assumed, Welch’s heteroscedastic one-way ANOVA (oneway.test()) (Welch 1951) provides an alternative to aov(). It compares group means using weights based on sample sizes and variances. The degrees of freedom are adjusted using a Satterthwaite-type approximation (Satterthwaite 1946), resulting in an F-statistic with non-integer degrees of freedom. The Welch F-statistic is calculated as (Welch 1951):

FW=i=1kwi(yiyw)2/(k1)1+2(k2)k21i=1k(1wi/w)2ni1F_W = \frac{\sum_{i=1}^{k} w_i (\bar{y}_i - \bar{y}_w)^2 / (k-1)}{1 + \frac{2(k-2)}{k^2-1} \sum_{i=1}^{k} \frac{(1-w_i/w)^2}{n_i-1}}

where wi=ni/si2w_i = n_i/s_i^2 are the weights (inverse variances), w=i=1kwiw = \sum_{i=1}^{k} w_i, yw=i=1kwiyi/w\bar{y}_w = \sum_{i=1}^{k} w_i \bar{y}_i / w is the weighted grand mean, kk is the number of groups, nin_i is the sample size of group ii, and si2s_i^2 is the variance of group ii.

Numerical relationships within the parametric tests defined by the decision logic above (including the identity t2=Ft^2 = F in the equal-variance two-group case) are summarised in Appendix A.

Kruskal–Wallis test (kruskal.test())

When the assumption of normality is not met, the Kruskal–Wallis test provides a non-parametric alternative. It compares group distributions based on ranked values and tests the null hypothesis that the groups come from the same population — specifically, that the distributions have the same location (Kruskal and Wallis 1952). If the group distributions are sufficiently similar in shape and scale, then the Kruskal–Wallis test can be interpreted as testing for equality of medians across groups (Hollander et al. 2014).

The test statistic is defined as:

H=12N(N+1)i=1kni(RiR)2,H = \frac{12}{N(N+1)} \sum_{i=1}^{k} n_i \left(\bar{R}_i - \bar{R} \right)^2,

where nin_i is the sample size in group ii, kk is the number of groups, Ri\bar{R}_i is the average rank of group ii, NN is the total sample size, and R=N+12\bar{R} = \frac{N+1}{2} is the average of all ranks. Under the null hypothesis, HH approximately follows a χ2\chi^2 distribution with k1k - 1 degrees of freedom.

Post-hoc analysis

ANOVA, Welch ANOVA, and Kruskal–Wallis are omnibus tests: a significant result tells us that some group differs, but not which. To identify the differing pairs we test all

M=n(n1)2M = \frac{n \cdot (n - 1)}{2}

pairwise comparisons among the nn factor levels, defining a family of tests (Abdi 2007). Without correction, the family-wise error rate—the probability of at least one false rejection across the family—grows quickly with nn. Because the three omnibus tests rest on different assumptions, visstat() pairs each with a different post-hoc procedure and returns the corresponding adjusted p-values for every pairwise comparison.

Following aov(): TukeyHSD()

When the residuals are normal and variances are homogeneous, visstat() follows aov() with Tukey’s Honestly Significant Differences procedure. TukeyHSD controls the family-wise error rate via the studentised range distribution, which exploits the common residual variance shared across pairs (Hochberg and Tamhane 1987). visstat() returns the HSD-adjusted p-values for every pairwise mean comparison.

Following oneway.test(): games.howell()

When the residuals are normal but variances are heterogeneous, the common-variance assumption of TukeyHSD breaks down. visstat() therefore follows Welch’s heteroscedastic oneway.test() with the Games–Howell procedure, which uses separate variance estimates for each pair and Welch-adjusted degrees of freedom (Games and Howell 1976). The returned object contains the Games–Howell-adjusted p-values for every pairwise comparison.

Following kruskal.test(): pairwise.wilcox.test()

When normality is rejected, visstat() follows kruskal.test() with pairwise.wilcox.test(), which compares each pair of factor levels via the Wilcoxon rank-sum test on ranks rather than means. The resulting p-values are adjusted for multiplicity using Holm’s step-down method (Holm 1979): sorted ascending and tested against thresholds that loosen with rank.

Graphical output

The graphical output for all tests based on a numeric response and a categorical predictor with more than two levels consists of two panels: the first focuses on the residual analysis, the second on the actual test chosen by the decision logic.

The residual panel addresses the assumption of normality, both graphically and through formal tests. It displays a scatter plot of the standardised residuals versus the predicted values, as well as a normal Q–Q plot comparing the sample quantiles to the theoretical quantiles. If the residuals are normally distributed, no more than 5%5\% of the standardised residuals should exceed approximately |2||2|; in the Q–Q plot the data points should approximately follow the red straight line.

The p-values of the formal tests for normality (Shapiro–Wilk and Anderson–Darling) as well as the tests for homoscedasticity (Bartlett’s and Levene Brown–Forsythe) are given in the title.

visstat() then illustrates, in the subsequent graph, either the kruskal.test(), the oneway.test(), or aov() result (see also Section “Decision logic”). In all three branches the result is shown as box plots with the number of observations per level above each box; the title gives the name of the test that was run and its p-value (and, for the parametric branches, the corresponding FF statistic). When the largest group contains no more than 50 observations, individual data points are overlaid as a jittered strip chart, making the raw distribution visible.

In every branch, pairs of groups whose adjusted post-hoc p-value falls below α\alpha are marked with different green letters below the box plots; pairs sharing a letter are not significantly different. The letters are produced by multcompLetters() from the multcompView package (Graves et al. 2026).

Besides the graphical output, visstat() returns a list containing the relevant test statistics along with the corresponding post-hoc-adjusted pp-values for all pairwise comparisons.

Examples

Fisher’s one-way ANOVA

The npk dataset reports the yield of peas (in pounds per block) from an agricultural experiment conducted on six blocks. In this experiment, the application of three different fertilisers – nitrogen (N), phosphate (P), and potassium (K) – was varied systematically. Each block received either none, one, two, or all three of the fertilisers.

anova_npk <- visstat(npk$block,npk$yield,conf.level=0.95)

Normality of residuals is supported by graphical diagnostics (histogram, scatter plot of standardised residuals, Q-Q plot) and formal tests (Shapiro–Wilk and Anderson- Darling, both with p>αp > \alpha). Homogeneity of variances is not supported at the given confidence level by bartlett.test(), but by levene.test() (p>αp > \alpha). The decision logic is solely based on levene.test() and triggers aov(). Post-hoc analysis with TukeyHSD(). Note that the omnibus F-test reports p = 0.086 — not significant at α=0.05\alpha=0.05. Therefore no pairwise differences are expected in the post-hoc analysis and all groups share the same green letter “a”.

Kruskal–Wallis rank sum test

The iris dataset contains petal width measurements (in cm) for three different iris species.

visstat(iris$Species, iris$Petal.Width)

In this example, scatter plots of the standardised residuals and the Q-Q plot suggest that the residuals are not normally distributed. This is confirmed by very small p-values from both the Shapiro–Wilk and Anderson-Darling tests.

Since the Shapiro–Wilk p-value is below α\alpha, visstat() switches to the non-parametric kruskal.test(). Post-hoc analysis using pairwise.wilcox.test() shows significant differences in petal width between all three species, as indicated by distinct group labels (all green letters differ).

Kruskal–Wallis with ordinal response

When the response is of class ordered and the predictor is a categorical factor with more than two levels, visstat() issues a warning, converts the ordered response to integer level codes and routes the analysis to kruskal.test() followed by pairwise.wilcox.test().

set.seed(123)

# Predictor: customer segment (3 groups)
segment <- factor(rep(c("Budget", "Standard", "Premium"), each = 50))

# Response: Likert scale ratings (1-5), with a deliberate trend across segments
comfort_numeric <- c(
  sample(1:5, 50, replace = TRUE, prob = c(0.30, 0.30, 0.20, 0.15, 0.05)),  # Budget
  sample(1:5, 50, replace = TRUE, prob = c(0.10, 0.20, 0.40, 0.20, 0.10)),  # Standard
  sample(1:5, 50, replace = TRUE, prob = c(0.05, 0.10, 0.20, 0.35, 0.30))   # Premium
)

# Dataframe with ORDERED response
survey_data_3 <- data.frame(
  segment = segment,
  comfort = ordered(comfort_numeric)  # Declare as ordered
)

# triggers warning and uses Kruskal-Wallis test
kruskal_ordered <- visstat(survey_data_3, "comfort", "segment")
## Warning: Ordered response detected. Converting to integer level codes for
## non-parametric analysis.

Both variables numeric

Simple linear regression (lm())

The regression plot displays the point estimate of the regression line

y=b0+b1x,y = b_0 + b_1 \cdot x,

where yy is the response variable, xx is the predictor variable, b0b_0 is the intercept, and b1b_1 is the slope of the regression line.

Residual analysis

visstat() checks the normality of the standardised residuals from lm() both with diagnostic plots and using the Shapiro–Wilk and Anderson-Darling tests. (via vis_lm_assumptions())

Note, that regardless of the result of the residual analysis, visstat() proceeds to perform the regression. The title of the graphical output indicates the chosen confidence level (conf.level), the estimated regression parameters with their confidence intervals and p-values, and R2R^2. The plot displays the raw data, the fitted regression line, and both the confidence and prediction bands corresponding to the specified conf.level.

visstat() returns a list containing the regression test statistics, the p-values from the normality tests of the standardised residuals, and the pointwise estimates of the confidence and prediction bands.

Examples

trees dataset

The trees data set contains the diameter Girth in inches and Volume in cubic ft of 31 black cherry trees.

linreg_trees <- visstat(trees$Girth, trees$Volume,conf.level=0.9)

## Warning: Statistical assumptions violated:
## Homoscedasticity violated (Breusch-Pagan p = 0.0178 )
## Analysis proceeded but interpret results cautiously.
## RECOMMENDATION: Consider exploring alternatives outside visstat() such as data transformations, generalised linear models, or robust regression. For an assumption-free, non-causal alternative consider rerunning with correlation = TRUE.

p-values greater than α\alpha = 1 - conf.level in both the Anderson–Darling and Shapiro–Wilk normality tests of the standardised residuals indicate that the normality assumption underlying the linear regression is met.

Increasing the confidence level conf.level from 0.9 to 0.99 results in wider confidence intervals of the regression parameters as well as wider confidence and prediction bands

linreg_trees_99 <- visstat(trees$Girth, trees$Volume,conf.level = 0.99)

The visStatistics plot-method allows to display only the second generated plot (the assumption plot is unchanged):

plot(linreg_trees_99,which=2)

Spearman rank correlation (cor.test(..., method = "spearman"))

For two numeric variables with correlation = TRUE, visstat() calls cor.test(x, y, method = "spearman") to measure the monotonic association between xx and yy using ranks, evaluating linear dependence on the ranked data rather than on the original scale:

ρ=r(rank(x),rank(y))\rho = r(\operatorname{rank}(x), \operatorname{rank}(y)) where r(u,v)r(u, v) denotes Pearson’s correlation coefficient: r(u,v)=i=1n(uiu)(viv)i=1n(uiu)2i=1n(viv)2.r(u,v) = \frac{\sum_{i=1}^{n}(u_i-\bar u)(v_i-\bar v)} {\sqrt{\sum_{i=1}^{n}(u_i-\bar u)^2}\, \sqrt{\sum_{i=1}^{n}(v_i-\bar v)^2}}.

For inference, cor.test(..., method = "spearman") computes an exact p-value for small samples without ties by evaluating all n!n! rank permutations. For larger samples or when ties are present, it uses an approximation to the null distribution of the test statistic. No distributional assumption on the original data is required.

Example

The airquality dataset contains daily air quality measurements in New York, May to September 1973 (Chambers 2018)

A default call to visstat() fits a simple linear regression model, but triggers the following warning:

result_ozone0 <- visstat(airquality$Wind, airquality$Ozone)
## Warning: Statistical assumptions violated:
## Normality of residuals violated (Shapiro-Wilk p = 0.00522 )
## Homoscedasticity violated (Breusch-Pagan p = 0.00595 )
## Analysis proceeded but interpret results cautiously.
## RECOMMENDATION: Consider exploring alternatives outside visstat() such as data transformations, generalised linear models, or robust regression. For an assumption-free, non-causal alternative consider rerunning with correlation = TRUE.

The warning indicates that both normality of residuals and homoscedasticity are violated (see Assumption diagnostics). It recommends exploring alternatives outside visstat() such as data transformations, generalised linear models, or robust regression (see Limitations). For an assumption-free, non-causal alternative within visstat(), the user is suggested to rerun the identical call with the optional flag correlation = TRUE added, which triggers a Spearman rank correlation:

result_ozone1 <- visstat(airquality$Wind, airquality$Ozone, correlation = TRUE)

Both variables categorical: Comparing proportions

Observed frequencies are arranged in a contingency table, where rows index the levels ii of the response variable and columns index the levels jj of the predictor variable. visstat() tests the null hypothesis that the two variables are independent.

Pearson’s χ2\chi^2-test (chisq.test())

Let OijO_{ij} and EijE_{ij} denote the observed and expected frequencies in row ii and column jj of an R×CR \times C contingency table. The Pearson residual for each cell is defined as

rij=OijEijEij,i=1,,R,j=1,,C.r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}, \quad i = 1, \ldots, R,\quad j = 1, \ldots, C.

The test statistic of Pearson’s χ2\chi^2-test (Pearson 1900) is the sum of squared Pearson residuals:

χ2=i=1Rj=1Crij2=i=1Rj=1C(OijEij)2Eij.\chi^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} r_{ij}^2 = \sum_{i=1}^{R} \sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}.

The test statistic is compared to the chi-squared distribution with (R1)(C1)(R - 1)(C - 1) degrees of freedom. The resulting p-value corresponds to the upper tail probability — that is, the probability of observing a value greater than or equal to the test statistic under the null hypothesis.

For general R×CR \times C tables, visstat() supplements the bar chart with a mosaic plot (Meyer et al. 2006, 2024), in which the area of each tile is proportional to the observed cell frequency and tiles are coloured by their Pearson residual rijr_{ij}: blue for positive residuals (observed exceeds expected) and red for negative ones (observed falls short of expected). This makes the cells driving the association immediately visible.

Pearson’s χ2\chi^2 test with Yates’ continuity correction

Pearson χ2\chi^2 statistic in 2×22 \times 2 contingency tables (resulting in only one degree of freedom) tends to overestimate the significance level of the test.

To correct for this, Yates proposed subtracting 0.5 from each absolute difference between observed and expected counts (Yates 1934), resulting in a smaller test statistic: χYates2=i=12j=12(|OijEij|0.5)2Eij.\chi^2_{\text{Yates}} = \sum_{i=1}^{2} \sum_{j=1}^{2} \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}}.

This reduced test statistic yields a larger p-value, thereby lowering the risk of a Type I error.

Yates’ continuity correction is applied by default to 2×22 \times 2 contingency tables with one degree of freedom by the underlying routine chisq.test().

Fisher’s exact test (fisher.test())

The χ2\chi^2 approximation is considered reliable only if no expected cell count is less than 1 and no more than 20 percent of cells have expected counts below 5 (Cochran 1954). If this condition is not met, Fisher’s exact test (Fisher 1970) (fisher.test()) is applied instead, as it is a non-parametric method that does not rely on large-sample approximations. The test calculates an exact p-value for testing independence by conditioning on the observed margins: the row totals Ri=j=1COijR_i = \sum_{j=1}^C O_{ij} and the column totals Cj=i=1ROijC_j = \sum_{i=1}^R O_{ij}, defining the structure of the contingency table.

In the 2×22 \times 2 case, the observed table can be written as:

C1C2Row sumsR1aba+bR2cdc+dColumn sumsa+cb+dn\begin{array}{c|cc|c} & C_1 & C_2 & \text{Row sums} \\\\ \hline R_1 & a & b & a + b \\\\ R_2 & c & d & c + d \\\\ \hline \text{Column sums} & a + c & b + d & n \end{array}

Let O=[abcd]O = \begin{bmatrix} a & b \\\\ c & d \end{bmatrix} denote the above observed 2×22 \times 2 contingency table. The exact probability of observing this table under the null hypothesis of independence, given the fixed margins, is given by the hypergeometric probability mass function (PMF)

(OR1,R2,C1,C2)=(a+ba)(c+dc)(na+c),\mathbb{P}(O \mid R_1, R_2, C_1, C_2) = \frac{\binom{a + b}{a} \binom{c + d}{c}}{\binom{n}{a + c}},

where n=a+b+c+dn = a + b + c + d is the total sample size.

The p-value is computed by summing the probabilities of all tables with the same margins whose probabilities under the null are less than or equal to that of the observed table.

For general R×CR \times C tables, fisher.test() generalises this approach using the multivariate hypergeometric distribution.

Test choice and graphical output

If the expected frequencies are sufficiently large - specifically, if at least 80% of the cells have expected counts greater than 5 and no expected count is smaller than 1, the function uses Pearson’s χ2{\chi}^2-test (chisq.test()).

Otherwise, it switches to Fisher’s exact test (fisher.test()) (Cochran 1954).

For 2-by-2 contingency tables, Yates’ continuity correction (Yates 1934) is always applied to Pearson’s χ2{\chi}^2-test.

The graphical output depends on the selected test:

  • Pearson’s χ2\chi^2 test (general R×CR \times C tables): a grouped column plot showing row percentages with the pp-value in the title, followed by a mosaic plot with colour-coded Pearson residuals.
  • Pearson’s χ2\chi^2 test with Yates’ continuity correction (2×22 \times 2 tables): a grouped column plot showing row percentages with the pp-value in the title.
  • Fisher’s exact test: a grouped column plot showing absolute counts with NN labels above each bar and the pp-value in the title.

Transforming a contingency table to a data frame

The following examples for tests of categorical predictor and response are all based on the HairEyeColor contingency table.

Contingency tables must be converted to the required column-based data.frame using the helper function counts_to_cases(). The function transforms the contingency table HairEyeColor into data.frame named HairEyeColourDataFrame.

HairEyeColourDataFrame <- counts_to_cases(as.data.frame(HairEyeColor))

Examples

In all examples of this section, we will test the null hypothesis that hair colour (“Hair”) and eye colour (“Eye”) are independent of each other.

Pearson’s χ2{\chi}^2-test (chisq.test())
hair_eye_colour_df <- counts_to_cases(as.data.frame(HairEyeColor))
visstat(hair_eye_colour_df$Eye, hair_eye_colour_df$Hair)

The graphical output shows that the null hypothesis of Pearson’s χ2\chi^2 test – namely, that hair colour and eye colour are independent – must be rejected at the default significance level α=0.05\alpha=0.05 (p=2.331025<αp = 2.33 \cdot 10^{-25} < \alpha). The mosaic plot indicates that the strongest deviations are due to over-representation of individuals with black hair and brown eyes, and of those with blond hair and blue eyes. In contrast, individuals with blond hair and brown eyes are the most under-represented.

Pearson’s χ2{\chi}^2-test with Yates’ continuity correction

In the following example, we restrict the data to participants with either black or brown hair and either brown or blue eyes, resulting in a 2-by-2 contingency table.

hair_black_brown_eyes_brown_blue <- HairEyeColor[1:2, 1:2, ]
# Transform to data frame
hair_black_brown_eyes_brown_blue_df <- counts_to_cases(as.data.frame(hair_black_brown_eyes_brown_blue))
# Chi-squared test with Yates' continuity correction

visstat(hair_black_brown_eyes_brown_blue_df$Eye, hair_black_brown_eyes_brown_blue_df$Hair)

Also in this reduced dataset we reject the null hypothesis of independence of the hair colours “brown” and “black” from the eye colours “brown” and “blue”. As a 2×22 \times 2 table, Yates’ continuity correction is applied and no mosaic plot is produced. Note that the Yates-corrected pp-value is slightly higher than the uncorrected Pearson pp-value, reflecting the more conservative correction.

Fisher’s exact test (fisher.test())

Again, we extract a 2-by-2 contingency table from the full dataset, this time keeping only male participants with black or brown hair and hazel or green eyes.

Pearson’s χ2{\chi}^2 test applied to this table would yield an expected frequency less than 5 in one of the four cells (25% of all cells), which violates the requirement that at least 80% of the expected frequencies must be 5 or greater (Cochran 1954).

Therefore, visstat() automatically selects Fisher’s exact test instead.

hair_eye_colour_male <- HairEyeColor[, , 1]
# Slice out a 2 by 2 contingency table
black_brown_hazel_green_male <- hair_eye_colour_male[1:2, 3:4]
# Transform to data frame
black_brown_hazel_green_male <- counts_to_cases(as.data.frame(black_brown_hazel_green_male))
# Fisher test
fisher_stats <- visstat(black_brown_hazel_green_male$Eye, black_brown_hazel_green_male$Hair)

Response of class ordered, predictor of class factor

When the response is an ordered factor, visstat() converts it internally to integer level codes and redirects to the non-parametric path (see examples in Section Comparing central tendencies).

Both variables of class factor and ordered: Wilcoxon/Kruskal-Wallis (default) or Kendall’s τb\tau_b (correlation = TRUE)

When both the response and the predictor are ordered factors and correlation = TRUE is set, visstat() tests the null hypothesis of no monotone association via Kendall’s τb\tau_b rank correlation (Kendall 1945; Agresti 2010). Without correlation = TRUE, both-ordered inputs follow the standard non-parametric path (Wilcoxon or Kruskal–Wallis).

Kendall’s τb\tau_b (cor.test(..., method = "kendall"))

For two ordinal variables with nn joint observations, let CC denote the number of concordant pairs (those whose ranks agree in both variables) and DD the number of discordant pairs. Kendall’s τb\tau_b is defined as

τb=CD(n0n1)(n0n2)\tau_b \;=\; \frac{C - D}{\sqrt{(n_0 - n_1)(n_0 - n_2)}}

where n0=n(n1)/2n_0 = n(n-1)/2, n1=iti(ti1)/2n_1 = \sum_i t_i(t_i-1)/2 summed over groups of tied ranks in the response, and n2n_2 is the analogous quantity for the predictor. The denominator correction makes τb\tau_b attain ±1\pm 1 even with ties, which Spearman’s ρ\rho does not (Kendall 1945). With few ordered levels (e.g., five-point Likert items), ties are unavoidable; this is the principal reason to prefer τb\tau_b over Spearman’s ρ\rho in this setting (Agresti 2010).

visstat() calls cor.test(as.numeric(y), as.numeric(x), method = "kendall", exact = FALSE) and reports τb\tau_b, the test statistic zz, and the two-sided pp-value.

Graphical output

One plot is produced: a jittered rank–rank scatter that visualises the monotone trend, with points colour-coded by the predictor level and annotated with τb\tau_b and the pp-value.

Example

We construct a hypothetical survey of 150 secondary-school students in which alcohol consumption frequency and academic performance are each recorded on a five-point ordinal scale. A negative monotone association is expected: students who consume alcohol more frequently tend to achieve lower academic performance.

set.seed(42)
n <- 150
# Latent scores with deliberate negative monotone association:
# higher alcohol consumption (xs) -> lower academic performance (ys)
xs <- sample(1:5, n, replace = TRUE)
ys <- pmin(5, pmax(1, (6 - xs) + sample(-1:1, n, replace = TRUE)))
likert_levels  <- c("never", "rarely", "sometimes", "often", "always")
likert_levels2 <- c("poor", "fair", "ok", "good", "great")
alcohol     <- ordered(likert_levels[xs],  levels = likert_levels)
performance <- ordered(likert_levels2[ys], levels = likert_levels2)
kendall_result <- visstat(performance, alcohol, correlation = TRUE)

Saving the graphical output

All generated graphics can be saved in any file format supported by Cairo() (Urbanek and Horner 2025), including “png”, “jpeg”, “pdf”, “svg”, “ps”, and “tiff” in the user specified plotDirectory.

If the optional argument plotName is not given, the naming of the output follows the pattern "testname_namey_namex.", where "testname" specifies the selected test and "namey" and "namex" are character strings naming the selected data vectors y and x, respectively. The suffix corresponding to the chosen graphicsoutput (e.g., "pdf", "png") is then concatenated to form the complete output file name.

In the following example, we store the graphics in png format in the plotDirectory tempdir() with the default naming convention:

# Graphical output written to plotDirectory: In this example
# a single bar chart showing absolute counts.
# Output file: chi_squared_or_fisher_Hair_Eye.png
save_fisher = visstat(black_brown_hazel_green_male$Eye, black_brown_hazel_green_male$Hair,
        graphicsoutput = "png", plotDirectory = tempdir())

The full file path of the generated graphics are stored as the attribute "plot_paths" on the returned object of class "visstat".

paths <- attr(save_fisher, "plot_paths")
print(paths)
## [1] "/tmp/RtmptvQS1P/chi_squared_or_fisher_Hair_Eye.png"

Remove the graphical output from plotDirectory:

## [1] TRUE

When assumptions plots (residual and Q-Q plot) are generated, the corresponding plot has the prefix "assumption_".

The visstat methods

Objects returned by visstat() are of class "visstat" and support the S3 methods print(), summary(), and plot().

print() displays the test used and the p-value; summary() provides the full contents of the returned object, including assumption tests and post-hoc comparisons.

iris_kruskal <- visstat(iris$Species, iris$Petal.Width)
print(iris_kruskal)
## Object of class 'visstat'
## 
## Available components:
## [1] "Kruskal Wallis rank sum test"                
## [2] "post-hoc by pairwise Wilcoxon rank sum test "
summary(iris_kruskal)
## Summary of visstat object
## 
## --- Named components ---
## [1] "Kruskal Wallis rank sum test"                
## [2] "post-hoc by pairwise Wilcoxon rank sum test "
## 
## --- Contents ---
## 
## $Kruskal Wallis rank sum test:
## 
##  Kruskal-Wallis rank sum test
## 
## data:  samples by fact
## Kruskal-Wallis chi-squared = 131.19, df = 2, p-value < 2.2e-16
## 
## 
## $post-hoc by pairwise Wilcoxon rank sum test :
## [[1]]
##                      diff lwr upr p adj
## versicolor-setosa      NA  NA  NA     0
## virginica-setosa       NA  NA  NA     0
## virginica-versicolor   NA  NA  NA     0

plot()

When visstat() is called with graphicsoutput specified, plot() lists the file paths of the stored graphics:

iris_kruskal_stored <- visstat(iris$Species, iris$Petal.Width,
                               graphicsoutput = "pdf",
                               plotName = "iris_kruskal",
                               plotDirectory = tempdir())
plot(iris_kruskal_stored)
## Plot [1] stored in /tmp/RtmptvQS1P/glm_assumptions_iris_kruskal.pdf
## Plot [2] stored in /tmp/RtmptvQS1P/iris_kruskal.pdf

When visstat() is called without graphicsoutput (the default interactive mode), the generated plots are captured internally. Calling plot() without which lists the available plots; calling plot() with which replays the selected plot in the interactive R session:

plot(iris_kruskal)
## Plot [1] captured. Use plot(obj, which = 1) to display.
## Plot [2] captured. Use plot(obj, which = 2) to display.
# Interactive only (not executed during vignette build):
plot(iris_kruskal, which = 2)
## [1] TRUE TRUE

Limitations

Default settings

The main purpose of this package is a decision-logic based automatic selection visualisation of the “right” statistical test. Therefore, except for the user-adjustable conf.level parameter, all statistical tests are applied using their default settings from the corresponding base R functions. As a consequence, paired tests are currently not supported and visstat() does not allow to study interactions terms between the different levels of an independent variable in an analysis of variance. Focusing on the graphical representation of tests, only simple linear regression is implemented, as multiple linear regressions cannot be visualised. When regression assumptions are violated, the package offers Spearman rank correlation (correlation = TRUE) as the sole alternative to linear regression. Alternative methods such as data transformation, generalised linear models or robust regression are not implemented: each requires user judgment — about the transformation family, the link function, or the estimator — that cannot be automated without substantially expanding the decision tree and increasing the risk of Type I error inflation.

Test decisions based solely on p-values of statistical tests

No single test maintains optimal Type I error rates and statistical power across all distributions (Olejnik and Algina 1987), and p-values obtained from these tests may be unreliable if their assumptions are violated.

Assessing assumptions solely through p-values can lead to both type I errors (false positives) and type II errors (false negatives). In large samples, even minor, random deviations from the null hypothesis can result in statistically significant p-values, leading to type I errors. Conversely, in small samples, substantial violations of the assumption may not reach statistical significance, resulting in type II errors (Kozak and Piepho 2018).

Moreover, assumption tests provide no information on the nature of deviations from the expected distribution (Shatz 2024). Thus the assessment of normality or homoscedasticity should never rely solely on p-values but should be complemented by visual inspection of the diagnostic plots generated by visstat().

Testing normality with the Shapiro–Wilk normality test

Normality tests behave poorly at both ends of the sample-size range: with small samples they fail to detect non-normality, and with large samples they flag negligible departures from normality as significant (Ghasemi and Zahediasl 2012; Fagerland 2012; Franc 2025).

This package uses n>50n > 50 per group as a rule of thumb threshold to omit normality testing, assuming parametric tests of central tendencies to become robust to non-normality at larger sample sizes based on the central limit theorem. The threshold is necessarily arbitrary; simulation studies show that the convergence rate depends on the skewness and kurtosis of the underlying distribution, with moderately skewed distributions requiring roughly 40–50 observations for adequate convergence of the sampling distribution of the mean (Fagerland 2012). Simulation studies suggest that Shapiro–Wilk has the highest power among normality tests in small to moderate (n=10n = 10 to 100) sample sizes (Razali and Wah 2011).

This package uses therefore the Shapiro-Wilk test to select between parametric (ANOVA/Welch’s t-test) and non-parametric (Kruskal-Wallis/Wilcoxon) methods for groups smaller than 50.

Visualisation based solely on base R

The package depends only on base R graphics with no ggplot2 (Wickham 2016) dependency keeping the transitive dependency footprint minimal. For more polished, annotated plots of chosen statistical test, we refer to packages as ggstatsplot (Patil 2021) or ggpubr (Kassambara 2026).

Combining tests inflates the overall Type I error rate

Combining multiple tests with differing assumptions using simple majority voting inflates the overall Type I error rate.

Therefore, automated test selection based solely on p-values cannot replace the visual inspection of sample distributions provided by visstat(). Based on the provided diagnostic plots, it may be necessary to override the automated choice of test in individual cases.

Limited number of implemented tests

While one of R’s greatest strengths is the sheer volume of statistical methods available, our automated test selection pipeline deliberately restricts its scope to the most frequently used tests, particularly those relevant in a medical context Chicco et al. (2025). Incorporating a wider array of methods would require additional preliminary assumption checks, which in turn exacerbates the risk of overall Type I error inflation. Furthermore, expanding the pipeline would result in a highly complex decision tree, rendering the underlying statistical logic increasingly opaque to the user.

Bootstrapping as modern alternative to hypothesis testing

Bootstrapping methods (Wilcox 2021) make minimal distributional assumptions and can provide confidence intervals for nearly any statistic. However, bootstrapping is computationally intensive, often requiring thousands of resamples, and may perform poorly with very small sample sizes.

The computational intensity of bootstrap runs counter to the purpose of the visStatistics package, which is designed to offer a rapid overview of the data, laying the groundwork for deeper analysis in subsequent steps.

Appendix A: The general linear model

The general linear model (Searle 1971) provides a unified mathematical framework underlying Student’s t-test, Fisher’s ANOVA, and simple linear regression.

Let nn denote the number of observations and k1k-1 the number of predictors. The model for observation i,i=1,,ni,\;i = 1, \ldots, n is:

Yi=β0+β1xi1++βk1xik1+εi,εiN(0,σ2)Y_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_{k-1} x_{ik-1} + \varepsilon_i, \quad \varepsilon_i \sim N(0, \sigma^2)

where YiY_i is the response for observation ii, xijx_{ij} is the value of predictor jj for observation ii, β0,β1,,βk1\beta_0, \beta_1, \ldots, \beta_{k-1} are the kk parameters, and εi\varepsilon_i are independently distributed error terms with constant variance σ2\sigma^2.

Student’s t-test as a linear model

Student’s t-test tests the null hypothesis that the means of two (unpaired) groups are equal. We create one indicator binary variable xi1x_{i1}, where xi1=0x_{i1} = 0 for observations in group 1 and xi1=1x_{i1} = 1 for observations in group 2. Taking expectations for each group:

  • Group 1 (xi1=0x_{i1} = 0): E[Yi|xi1=0]=β0=μ1E[Y_i | x_{i1} = 0] = \beta_0 = \mu_1
  • Group 2 (xi1=1x_{i1} = 1): E[Yi|xi1=1]=β0+β1=μ2E[Y_i | x_{i1} = 1] = \beta_0 + \beta_1 = \mu_2

Therefore, β1=μ2μ1\beta_1 = \mu_2 - \mu_1 represents the difference between group means. Testing H0:β1=0H_0: \beta_1 = 0 is mathematically equivalent to testing H0:μ1=μ2H_0: \mu_1 = \mu_2 in Student’s t-test.

Fisher’s ANOVA as a linear model

Fisher’s one-way ANOVA tests the null hypothesis that the means of kk groups are equal.

It can be formulated within the linear model framework using k1k-1 indicator variables xi1,xi2,,xik1x_{i1}, x_{i2}, \ldots, x_{ik-1} for each observation ii to represent group membership. The indicator variable coding is defined as follows:

  • Group 1 (reference): xi1=xi2==xik1=0x_{i1} = x_{i2} = \cdots = x_{ik-1} = 0 for all observations i in group 1
  • Group 2: xi1=1x_{i1} = 1 and xi2==xik1=0x_{i2} = \cdots = x_{ik-1} = 0 for all observations i in group 2
  • Group 3: xi2=1x_{i2} = 1 and xi1=xi3==xik1=0x_{i1} = x_{i3} = \cdots = x_{ik-1} = 0 for all observations i in group 3
  • Group k: xik1=1x_{ik-1} = 1 and xi1==xi,k2=0x_{i1} = \cdots = x_{i,k-2} = 0 for all observations i in group k

Taking expectations for each group:

  • Group 1: E[Yi|xi1=xi2==xik1=0]=β0=μ1E[Y_i | x_{i1} = x_{i2}=\cdots = x_{ik-1} = 0] = \beta_0 = \mu_1
  • Group 2: E[Yi|xi1=1]=β0+β1=μ2E[Y_i | x_{i1} = 1] = \beta_0 + \beta_1 = \mu_2
  • Group 3: E[Yi|xi2=1]=β0+β2=μ3E[Y_i | x_{i2} = 1] = \beta_0 + \beta_2 = \mu_3
  • Group k: E[Yi|xik1=1]=β0+βk1=μkE[Y_i | x_{ik-1} = 1] = \beta_0 + \beta_{k-1} = \mu_{k}

Testing H0:β1=β2==βk1=0H_0: \beta_1 = \beta_2 = \cdots = \beta_{k-1} = 0 is mathematically equivalent to testing H0:μ1=μ2==μkH_0: \mu_1 = \mu_2 = \cdots = \mu_{k} in Fisher’s ANOVA.

Relations between tests

The two-sample case sits inside both the t-test and the ANOVA frameworks, so several pairs of tests are numerically related.

Welch’s t-test → Student’s t-test. When the assumption of equal variances holds (s12=s22=s2s_1^2 = s_2^2 = s^2) and sample sizes are equal (n1=n2=nn_1 = n_2 = n), Welch’s t-test reduces to Student’s t-test with 2n22n - 2 degrees of freedom. This exact equivalence does not extend to the multi-group case:

Welch’s one-way ANOVA → Fisher’s one-way ANOVA. Even under equal variances (s12==sk2s_1^2 = \cdots = s_k^2) and equal sample sizes (n1==nkn_1 = \cdots = n_k), the Welch test statistic is not algebraically identical to the classical ANOVA FF-statistic. Nevertheless, under these conditions the Welch statistic converges to the classical FF-statistic, and any numerical differences become negligible in practice (Welch 1951).

t-tests as special cases of ANOVA. t-tests are special cases of ANOVA for the comparison of two groups. The squared tt-statistic equals the corresponding FF-statistic:

t2=Ft^2 = F.

For the comparison of two groups, Student’s t-test, t.test(var.equal = TRUE) and aov() yield identical p-values, as do Welch’s t-test t.test(var.equal = FALSE) and oneway.test(). In this case visstat() reports the t-statistic, as it provides sign information (indicating which group has the larger mean).

Simple linear regression as a linear model

Simple linear regression corresponds to one continuous predictor xi1x_{i1} for observation ii. Testing H0:β1=0H_0: \beta_1 = 0 examines whether there is a linear relationship between the predictor and the response.

Appendix B: Tests for homoscedasticity

The Levene–Brown–Forsythe test levene.test()

The Levene–Brown–Forsythe test improves upon Levene’s original test (Levene 1960) by using the median instead of the mean to centre the data.

This makes it more robust to skewed data or data with outliers providing more reliable results in many practical situations (Allingham and Rayner 2012).

levene.test() mimics the default behaviour of leveneTest() in the car package (Fox and Weisberg 2019).

The Levene–Brown–Forsythe test evaluates the null hypothesis that all groups have equal variances by testing whether the absolute deviations from group medians are equal across groups.

For each observation yijy_{ij} in group ii, it computes the absolute deviation from the group median:

zij=|yijyĩ|,z_{ij} = |y_{ij} - \tilde{y_i}|,

where yĩ\tilde{y_i} is the median of group ii.

The test statistic is the F-statistic from a one-way ANOVA on the zijz_{ij} values:

F=i=1kni(ziz)2k1i=1kj=1ni(zijzi)2Nk=(Nk)i=1kni(ziz)2(k1)i=1kj=1ni(zijzi)2F = \frac{\frac{\sum_{i=1}^{k} n_i (\bar{z}_i - \bar{z})^2}{k-1}}{\frac{\sum_{i=1}^{k} \sum_{j=1}^{n_i} (z_{ij} - \bar{z}_i)^2}{N-k}} = \frac{(N-k) \sum_{i=1}^{k} n_i (\bar{z}_i - \bar{z})^2}{(k-1) \sum_{i=1}^{k} \sum_{j=1}^{n_i} (z_{ij} - \bar{z}_i)^2}

where kk is the number of groups, NN is the total sample size, nin_i is the sample size of group ii, zi\bar{z}_i is the mean of absolute deviations from the median in group ii, and z\bar{z} is the overall mean of all absolute deviations.

Under the null hypothesis of equal variances, the test statistic follows an F-distribution: FF(k1,Nk)F \sim F(k-1, N-k).

Bartlett’s test bartlett.test()

Additionally, homoscedasticity is assessed via Bartlett’s test (Bartlett 1937) (bartlett.test()), which has more power than the Brown–Forsythe version of Levene’s test (Brown and Forsythe 1974) when the normality assumption is met (Allingham and Rayner 2012).

Bartlett’s test evaluates whether sample variances are equal across kk normally distributed groups.

The test statistic is

K2=(Nk)lnsp2i=1k(ni1)lnsi21+13(k1)(i=1k1ni11Nk),K^2 = \frac{(N - k) \ln s_p^2 - \sum_{i=1}^k (n_i - 1) \ln s_i^2}{ 1 + \frac{1}{3(k - 1)} \left( \sum_{i=1}^k \frac{1}{n_i - 1} - \frac{1}{N - k} \right)}, where si2s_i^2 is the sample variance of group ii, and sp2s_p^2 is the pooled variance:

sp2=1Nki=1k(ni1)si2.s_p^2 = \frac{1}{N - k} \sum_{i=1}^k (n_i - 1) s_i^2.

Under the null hypothesis that all group variances are equal and the data are normally distributed, the test statistic approximately follows a χ2\chi^2-distribution with k1k - 1 degrees of freedom (Bartlett 1937).

Breusch-Pagan test bp.test()

For linear regression, group-based tests are not applicable. Instead, the visStatistics package function bp.test() implements the Breusch-Pagan test.

The Breusch-Pagan test assesses whether the variance of the residuals from a regression model depends on the values of the independent variables (Breusch and Pagan 1979).

The diagnostic plots together with the p-values of the tests for normality and homoscedasticity enable the user to assess whether the linear model assumptions are met and manually override the automated test selection.

Bibliography

Abdi, Hervé. 2007. “The Bonferonni and Šidák Corrections for Multiple Comparisons.” Encyclopedia of Measurement and Statistics (Thousand Oaks, CA).
Agresti, Alan. 2010. Analysis of Ordinal Categorical Data. 1st ed. Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470594001.
Allingham, David, and J. C. W. Rayner. 2012. “Testing Equality of Variances for Multiple Univariate Normal Populations.” Journal of Statistical Theory and Practice 6 (3): 524–35. https://doi.org/10.1080/15598608.2012.695703.
Bartlett, M. S. 1937. “Properties of Sufficiency and Statistical Tests.” Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences 160 (901): 268–82. https://doi.org/10.1098/rspa.1937.0109.
Bijlenga, Philippe, Renato Gondar, Sabine Schilling, et al. 2017. PHASES Score for the Management of Intracranial Aneurysm: A Cross-Sectional Population-Based Retrospective Study.” Stroke 48 (8): 2105–12. https://doi.org/10.1161/STROKEAHA.117.017391.
Breusch, T. S., and A. R. Pagan. 1979. “A Simple Test for Heteroscedasticity and Random Coefficient Variation.” Econometrica 47 (5): 1287–94. https://doi.org/10.2307/1911963.
Brown, Morton B., and Alan B. Forsythe. 1974. “Robust Tests for the Equality of Variances.” Journal of the American Statistical Association 69 (346): 364–67. https://doi.org/10.1080/01621459.1974.10482955.
Chambers, J. M. 2018. Graphical Methods for Data Analysis. Chapman and Hall/CRC. https://doi.org/10.1201/9781351072304.
Chicco, Davide, Andrea Sichenze, and Giuseppe Jurman. 2025. “A Simple Guide to the Use of Student’s t-Test, Mann-Whitney U Test, Chi-squared Test, and Kruskal-Wallis Test in Biostatistics.” BioData Min 18 (1): 56. https://doi.org/10.1186/s13040-025-00465-6.
Cochran, William G. 1954. “The Combination of Estimates from Different Experiments.” Biometrics 10 (1): 101. https://doi.org/10.2307/3001666.
Cook, R. Dennis, and Sanford Weisberg. 1982. Residuals and Influence in Regression. New York: Chapman and Hall.
Delacre, Marie, Daniël Lakens, and Christophe Leys. 2017. “Why Psychologists Should by Default Use Welch’s t-Test Instead of Student’s t-Test.” International Review of Social Psychology 30 (1): 92–101. https://doi.org/10.5334/irsp.82.
Fagerland, Morten W. 2012. “T-Tests, Non-Parametric Tests, and Large Studies—a Paradox of Statistical Practice?” BMC Medical Research Methodology 12 (1): 78. https://doi.org/10.1186/1471-2288-12-78.
Fagerland, Morten W., and Leiv Sandvik. 2009. “Performance of Five Two-Sample Location Tests for Skewed Distributions with Unequal Variances.” Contemporary Clinical Trials 30 (5): 490–96. https://doi.org/10.1016/j.cct.2009.06.007.
Fisher, Ronald A., and F Yates. 1990. Statistical Methods, Experimental Design, and Scientific Inference: A Re-issue of Statistical Methods for Research Workers, the Design of Experiments and Statistical Methods and Scientific Inference. Edited by J H Bennett. Oxford University PressOxford. https://doi.org/10.1093/oso/9780198522294.001.0001.
Fisher, Ronald Aylmer. 1970. Statistical Methods for Research Workers. 14th ed., revised and enlarged. Oliver and Boyd.
Fox, John, and Sanford Weisberg. 2019. An R Companion to Applied Regression. 3rd ed. Sage.
Franc, Jeffrey Michael. 2025. “The Misuse of Normality Tests as Gatekeepers for Research in Prehospital and Disaster Medicine.” Prehospital and Disaster Medicine 40 (5): 241–42. https://doi.org/10.1017/S1049023X25101465.
Games, Paul A., and John F. Howell. 1976. “Pairwise Multiple Comparison Procedures with Unequal N’s and/or Variances: A Monte Carlo Study.” Journal of Educational Statistics (US) 1 (2): 113–25. https://doi.org/10.2307/1164979.
Ghasemi, Asghar, and Saleh Zahediasl. 2012. “Normality Tests for Statistical Analysis: A Guide for Non-Statisticians.” Int J Endocrinol Metab 10 (2): 486–89. https://doi.org/10.5812/ijem.3505.
Graves, Spencer, Hans-Peter Piepho, and Luciano Selzer with help from Sundar Dorai-Raj. 2026. multcompView: Visualizations of Paired Comparisons. Manual. https://doi.org/10.32614/CRAN.package.multcompView.
Gross, Juergen, and Uwe Ligges. 2015. Nortest: Tests for Normality. Manual. https://doi.org/10.32614/CRAN.package.nortest.
Hochberg, Yosef, and Ajit C. Tamhane. 1987. Multiple Comparison Procedures. 1st ed. Wiley Series in Probability and Statistics. Wiley. https://doi.org/10.1002/9780470316672.
Hollander, Myles, Eric Chicken, and Douglas A. Wolfe. 2014. Nonparametric Statistical Methods. Third edition. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc.
Holm, Sture. 1979. “A Simple Sequentially Rejective Multiple Test Procedure.” Scandinavian Journal of Statistics 6 (2): 65–70. https://www.jstor.org/stable/4615733.
Kassambara, Alboukadel. 2026. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. Manual. https://doi.org/10.32614/CRAN.package.ggpubr.
Kendall, M. G. 1945. “The Treatment of Ties in Ranking Problems.” Biometrika 33 (3): 239–51. https://doi.org/10.2307/2332303.
Kozak, M., and H.-P. Piepho. 2018. “What’s Normal Anyway? Residual Plots Are More Telling Than Significance Tests When Checking ANOVA Assumptions.” J Agronomy Crop Science 204 (1): 86–98. https://doi.org/10.1111/jac.12220.
Kruskal, William H., and W. Allen Wallis. 1952. “Use of Ranks in One-Criterion Variance Analysis.” Journal of the American Statistical Association 47 (260): 583–621. https://doi.org/10.2307/2280779.
Kwak, Sang Gyu, and Jong Hae Kim. 2017. “Central Limit Theorem: The Cornerstone of Modern Statistics.” Korean J Anesthesiol 70 (2): 144–56. https://doi.org/10.4097/kjae.2017.70.2.144.
Levene, Howard. 1960. “Robust Tests for Equality of Variances.” In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, edited by Ingram Olkin. Stanford University Press.
Lumley, Thomas, Paula Diehr, Scott Emerson, and Lu Chen. 2002. “The Importance of the Normality Assumption in Large Public Health Data Sets.” Annu. Rev. Public Health 23 (1): 151–69. https://doi.org/10.1146/annurev.publhealth.23.100901.140546.
Mann, Henry B., and Donald R. Whitney. 1947. “On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other.” The Annals of Mathematical Statistics 18 (1): 50–60. https://doi.org/10.1214/aoms/1177730491.
Meyer, David, Achim Zeileis, and Kurt Hornik. 2006. “The Strucplot Framework: Visualizing Multi-Way Contingency Tables with Vcd.” Journal of Statistical Software 17 (3): 1–48. https://doi.org/10.18637/jss.v017.i03.
Meyer, David, Achim Zeileis, Kurt Hornik, and Michael Friendly. 2024. vcd: Visualizing Categorical Data. Manual. https://doi.org/10.32614/CRAN.package.vcd.
Moser, B K, and G. R. Stevens. 1992. “Homogeneity of Variance in the Two-Sample Means Test.” The American Statistician, February, 19–21. https://doi.org/10.1080/00031305.1992.10475839.
Olejnik, Stephen F., and James Algina. 1987. “Type I Error Rates and Power Estimates of Selected Parametric and Nonparametric Tests of Scale.” Journal of Educational Statistics 12 (1): 45. https://doi.org/10.2307/1164627.
Patil, Indrajeet. 2021. “Visualizations with Statistical Details: The ’Ggstatsplot’ Approach.” Journal of Open Source Software 6 (61): 3167. https://doi.org/10.21105/joss.03167.
Pearson, Karl. 1900. “On the Criterion That a Given System of Deviations from the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen from Random Sampling.” The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 50 (302): 157–75. https://doi.org/10.1080/14786440009463897.
Rasch, Dieter, Klaus D. Kubinger, and Karl Moder. 2011. “The Two-Sample t Test: Pre-Testing Its Assumptions Does Not Pay Off.” Stat Papers 52 (1): 219–31. https://doi.org/10.1007/s00362-009-0224-x.
Razali, Nornadiah Mohd, and Yap Bee Wah. 2011. “Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-Darling Tests.” Journal of Statistical Modeling and Analytics 2 (1): 21–33.
Sato, Yasunori, Masahiko Gosho, Kengo Nagashima, Sho Takahashi, James H. Ware, and Nan M. Laird. 2017. “Statistical Methods in the ijournal/i &#X2014; an Update.” New England Journal of Medicine 376 (11): 1086–87. https://doi.org/10.1056/NEJMc1616211.
Satterthwaite, F. E. 1946. “An Approximate Distribution of Estimates of Variance Components.” Biometrics Bulletin 2 (6): 110–14. https://doi.org/10.2307/3002019.
Searle, Shayle R. 1971. Linear Models. John Wiley & Sons.
Shapiro, S. S., and M. B. Wilk. 1965. “An Analysis of Variance Test for Normality (Complete Samples).” Biometrika 52 (3-4): 591–611. https://doi.org/10.1093/biomet/52.3-4.591.
Shatz, Itamar. 2024. “Assumption-Checking Rather Than (Just) Testing: The Importance of Visualization and Effect Size in Statistical Diagnostics.” Behav Res 56 (2): 826–45. https://doi.org/10.3758/s13428-023-02072-x.
Strasak, Alexander M., Qamruz Zaman, Gerhard Marinell, Karl P. Pfeiffer, and Hanno Ulmer. 2007. “The Use of Statistics in Medical Research: A Comparison of "The New England Journal of Medicine" and "Nature Medicine".” The American Statistician 61 (1): 47–55. https://www.jstor.org/stable/27643837.
Subirana, Isaac, Héctor Sanz, and Joan Vila. 2014. “Building Bivariate Tables: The compareGroups Package for R.” Journal of Statistical Software 57 (12): 1–16.
Urbanek, Simon, and Jeffrey Horner. 2025. Cairo: R Graphics Device Using Cairo Graphics Library for Creating High-Quality Bitmap (PNG, JPEG, TIFF), Vector (PDF, SVG, PostScript) and Display (X11 and Win32) Output. Manual. https://doi.org/10.32614/CRAN.package.Cairo.
Welch, B. L. 1947. “The Generalization of ‘Student’s’ Problem When Several Different Population Variances Are Involved.” Biometrika 34 (1–2): 28–35. https://doi.org/10.1093/biomet/34.1-2.28.
Welch, B. L. 1951. “On the Comparison of Several Mean Values: An Alternative Approach.” Biometrika 38 (3/4): 330–36. https://doi.org/10.2307/2332579.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.
Wilcox, Rand R. 2021. Introduction to Robust Estimation and Hypothesis Testing.
Xu, Weichao, Yunhe Hou, Y. S. Hung, and Yuexian Zou. 2013. “A Comparative Analysis of Spearman’s Rho and Kendall’s Tau in Normal and Contaminated Normal Models.” Signal Processing 93 (1): 261–76. https://doi.org/10.1016/j.sigpro.2012.08.005.
Yap, B. W., and C. H. Sim. 2011. “Comparisons of Various Types of Normality Tests.” Journal of Statistical Computation and Simulation 81 (12): 2141–55. https://doi.org/10.1080/00949655.2010.520163.
Yates, F. 1934. “Contingency Tables Involving Small Numbers and the χ\chi2 Test.” Journal of the Royal Statistical Society Series B: Statistical Methodology 1 (2): 217–35. https://doi.org/10.2307/2983604.