## Abstract

*t*-Tests are widely used by researchers to compare the average values of a numeric outcome between two groups. If there are doubts about the suitability of the data for the requirements of a *t*-test, most notably the distribution being non-normal, the Wilcoxon-Mann-Whitney test may be used instead. However, although often applied, both tests may be invalid when discrete and/or extremely skew data are analyzed. In medicine, extremely skewed data having an excess of zeroes are often observed, representing a numeric outcome that does not occur for a large percentage of cases (so is often zero) but which also sometimes takes relatively large values. For data such as this, application of the *t*-test or Wilcoxon-Mann-Whitney test could lead researchers to draw incorrect conclusions. A valid alternative is regression modeling to quantify the characteristics of the data. The increased availability of software has simplified the application of these more complex statistical analyses and hence facilitates researchers to use them. In this article, we illustrate the methodology applied to a comparison of cyst counts taken from control and steroid-treated fetal mouse kidneys.

- Poisson
- negative binomial
- zero inflation

when faced with the task of comparing a numeric outcome between two groups, most clinicians perform either a *t*-test or, if the data set is small and/or the assumptions for parametric testing are not met, a Wilcoxon-Mann-Whitney test (WMW; also called the Mann-Whitney-Wilcoxon test, Wilcoxon rank-sum test, Wilcoxon test, or Mann-Whitney *U*-test). Statistical methods may be divided according to the assumptions they demand from the data. Parametric methods suppose that the data follow a specific probability distribution whose parameters can be estimated, whereas the basic principle of nonparametric tests, such as the WMW test, is that the observed pooled values are ranked and no particular distributional assumptions are made. Statistical analyses of highly skewed distributions can be especially problematic, particularly when the data are discrete and exhibit value inflation.

The *t*-test is a parametric test of the difference in mean between two groups that assumes that the data are normally distributed and the groups have equal variances. Welch's approximation (13) can be used when the assumption of equal variances is not satisfied. While it is commonly believed that the WMW test compares medians in the same way that a *t*-test compares means, this does not always hold. In some cases, a WMW test indicates a significant difference between groups when the medians are equal (2, 3, 6), in contrast to what many researchers would expect. There are two versions of the WMW test: a general form and the location-shift model, both of which can be problematic when skewed discrete data are analyzed.

The general form of the WMW test assesses the probability (*P*) that any observation drawn at random from the first group (*X*) will exceed any observation from the second group (*Y*) i.e., *P* (*X* > *Y*) = 0.5 against *P* (*X* > *Y*) ≠ 0.5 or a one-sided alternative, *P* (*X* > *Y*) > 0.5 or *P* (*X* > *Y*) < 0.5. It assumes that all the observations are independent and are either continuous or ordinal with no possibility of ties (6). This is often an unrealistic assumption for ordinal and discrete data. Although the variance of the WMW test statistic can be adjusted in the presence of ties, this adjustment always results in an increase in the test statistic and smaller *P* values, and this version of the WMW test becomes less conservative (12) for increasing proportions of tied observations in the pooled sample.

The location-shift form of the WMW test is more restrictive and assumes that the distribution of values in each group differ only on location. Using this form, the test compares the significance of any differences in medians (6). Fagerland and Sandvik (5) showed that the WMW test in its location-shift form is not robust in the presence of skewed data. Hence, when the distributions are skewed and dispersions are unequal, this form of the WMW test is inappropriate, and, consequently, incorrect interpretations may be made. If all necessary assumptions are made, then it is unlikely that the *t*-test of means would be invalid (3), and, therefore, the location-shift form has very limited applicability (6).

In this article, we analyzed counts of cysts in two groups of mouse kidneys: one treated with a steroid and a control group. Kidney cyst counts were often zero and had a highly skewed distribution. When comparisons were made between the two groups, a WMW test detected a significant difference, although the group medians were both equal to zero. We used this data set to illustrate alternative and valid methods for quantifying and formally testing the significance of differences for data of this form. The standard practice of applying WMW tests to overcome the usual assumptions of the *t*-test without consideration of the data's distribution is not a panacea for every data set where a *t*-test is inappropriate. In this, the assumptions of the location-shift form of the hypothesis of the WMW test are invalid, and the cyst data set is inappropriate for the correct application of this test. While the general version of the WMW test is not incorrect in this instance, it provides less information about the difference in location than other methods. We showed how considering several regression models offers a viable and more informative alternative means of analysis for data with an excess of zeroes and extreme skewness. Regardless of the issue of validity of the WMW test, a regression approach has some advantages over simple tests of comparisons: *1*) model parameters can be estimated to allow for greater interpretation of the data and *2*) goodness-of-fit statistics allow for comparisons between regression models.

This is a training article to illustrate common misconceptions about the *t*-test and WMW test and to propose the use of discrete regression modeling as an alternative. The data set we used represents data resulting from a typical experiment found in research. It provides a simple example of a situation where the assumptions of the commonly applied *t*-test and WMW test are violated. Although these tests are clearly inappropriate for the real-life data presented here, we include them for illustrative purposes.

## METHODS

#### Data.

The example data set originated from a study (3a) investigating the effect of a corticosteroid on cyst formation in mice fetuses undertaken within the Department of Nephro-Urology at the Institute of Child Health of University College London. Embryonic mouse kidneys were cultured, and a random sample was subjected to steroids (*n* = 111), whereas the remainder acted as controls (*n* = 103). Six days later, cyst counts were compared between the groups.

Summary descriptive statistics and bar charts were used to describe the data. Normality was formally assessed using the Shapiro-Wilks test (12), and the control and steroid-treated groups were compared using a *t*-test and WMW test. Mean and median differences are presented with 95% confidence intervals (CIs). A χ^{2}-test (4) assessed the difference in the percent zero counts between the two groups.

#### Regression modeling.

A series of different regression models was used to describe the relationship between cyst counts and steroid status. First, a normal linear regression model was used to compare the means between the two groups, which for a single binary predictor (steroid/control) is equivalent to the *t*-test. The assumptions of linear regression are the same as for the *t*-test, i.e., that the data are continuously numeric and normally distributed with equal variances in each group. The purpose of presenting this regression model is to allow comparisons of model fit diagnostics with other, less commonly used, models. Regression diagnostics are not available when the *t*-test is used.

Second, a Poisson regression model was fitted to the data. This model is appropriate when the outcome is a count or a rate and quantifies the mean number of events (or counts). The mean and variance are assumed to be equal in each of the two groups (steroid/control) (1). If the variance of the distribution is significantly greater than the mean, there is overdispersion, and the Poisson regression model will correctly estimate the covariate parameters but will underestimate their SEs (8). An extension of the Poisson regression model is the negative binomial regression model (1), which incorporates an extra parameter to allow for overdispersion in the data. The improvement in model fit obtained using the negative binomial rather than the Poisson model can be formally tested.

#### Allowing for zero inflation.

Zero inflation occurs where there is an excess number of zeroes in the population, often arising from a separate process in the population. For example, some individuals are infertile and will never have children, and others are fertile but may be childless; the total number of individuals with no children depends on both processes combined.

Both the Poisson and negative binomial regression models can be extended to incorporate an additional parameter that allows for zero inflation in the data. The resulting models are known as the zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) (1), and the extra parameter quantifies the excess of zeroes. These models result in two parameter estimates for each covariate entered into the model. For example, in our data set, one parameter will estimate the increased probability of a kidney being unable to produce cysts. The other parameter will estimate the difference in cyst counts between the steroid-treated and untreated groups of those kidneys that can/do produce cysts (which can take values of 0, 1, 2, etc.). Note that the total number of zeroes is the sum of kidneys unable to produce cysts and those kidneys that can produce cysts but haven't during this experiment.

#### Model parameters.

Parameter estimates in linear regression give the difference in means between the groups. In contrast, the Poisson and negative binomial models yield rate ratios (RRs), which estimate the change in the relative (rather than absolute) mean number of events between the groups. RRs can be expressed in different ways. For example, RR = 1.25 indicates that the mean in one group is, on average, 1.25 times higher or, alternatively, that there is a 25% increase in one group compared with the other. On the other hand, e.g., RR = 0.83 indicates a 17% decrease in one group compared with the other.

Zero-inflated regression models additionally yield the relative odds of the proportions of zeros in each group. The odds and probability are calculated by dividing the number of events by the number without the event and by the total number of observations, respectively. For example, an odds value of 2 (for every two individuals that experience the event, one individual does not) equates to a probability of 0.33 [equal to 2/(1+2) = 2/3]; odds of 0.25 (for every individual for whom the event occurs, there are four individuals for whom it does not) equates to a probability of 0.2 (equal to 1/5). Odds ratios are calculated as the odds in one group divided by the odds in another group.

Relative risks and odds ratios are always positive, and a value of 1 means that there is no difference between the groups. All estimates (mean differences and rate and odds ratios) are presented with 95% CIs.

#### Comparison of regression models.

There are a variety of ways of deciding which regression model best describes the data. A more complex model, with additional parameters, generally improves the goodness of fit (i.e., how closely the model follows the data) but requires the estimation of extra parameters. Ideally, our chosen optimal model should describe the data's main features without losing information or being overly complex. It is important that model selection procedures avoid including redundant parameters (overfitting); we do not want to have more parameters than necessary.

The regression models described above (linear, Poisson, negative binomial, ZIP, and ZINB) are not all direct extensions of each other. Hence, we need to use comparison methods that do not rely on the models being nested, i.e., that all the parameters of the smaller model of the two being compared are included in the larger model. Therefore, we compared models using the Bayesian information criterion (BIC), which is a well-established measure of goodness of fit (10) that also applies to nonnested models. The model's likelihood is a measure of how well it fits the data set. The BIC is based on the model's log likelihood larger values indicating better goodness of fit. To account for the fact that more complex models always lead to the likelihood remaining the same or increasing (adding parameters cannot make the overall fit worse), BIC makes an adjustment penalizing for the number of estimated parameters and is more conservative against overfitting, i.e., it leads to models with smaller number of parameters than other measures of goodness of fit. This is particularly important when the potential number of covariates to be adjusted for is large. Models with lower BIC values indicate a better penalized goodness of fit; however, its absolute value is meaningless by itself, and only differences of BIC matter.

All analyses were performed using the R environment for statistical computing (version 2.10.1) (11) in a Windows platform. The glm function in the base package of R was used to fit normal, Poisson, and negative binomial models, and the pscl library (15) was used to fit zero-inflated models, both of which use maximum likelihood estimation methods.

## RESULTS

#### Exploratory data analysis.

Bar charts of the cyst frequencies (Fig. 1) showed that most kidneys had no cysts (74.3% overall), although there are a few kidneys in each group with large cyst counts; hence, the distributions were highly skewed to the right and zero inflated. The overall mean number of cysts was 0.87 (SD: 2.28), indicating overdispersion since the variance is almost four times larger than the mean. The values showed larger number of kidneys with low counts in the control group. The steroid-treated group had 1 kidney with 19 cysts, which was much higher than the maximum number of cysts found in the group of control kidneys (maximum: 3). Not surprisingly, the Shapiro-Wilks test for normality gave *P* < 0.01, indicating that the counts did not follow a normal distribution. Similarly, Bartlett's *F*-test for homogeneity of variances yielded *P* < 0.01, which implied a significant difference in the variances between the two groups. Note that this test assumes normality. The Ansari-Bradley test for equal scale parameters (9) also gave *P* < 0.01, indicating a significant difference in dispersion between the groups.

Table 1 shows summary statistics for the steroid-treated and control groups. The means were 1.55 and 0.15 for the steroid-treated and control groups, with SDs of 2.98 and 0.51, respectively. The medians for the two groups were both 0, with the steroid-treated group having an interquartile range of 2 and the control group an interquartile range of 0. The mean and SD in the steroid-treated group were much higher than those in the control group, although the medians were equal.

#### Tests of comparisons for the two groups.

Clearly, the assumptions for the application of a two-sample *t*-test were not met, but we present the results purely for comparative purposes. A *t*-test yielded a highly significant difference between the means of 1.40 [95% CI: (0.82, 1.99), *P* < 0.01]. The WMW test under the general null hypothesis *P* (*X* > *Y*) = 0.5 produced *P* < 0.01, similarly indicating a significant difference. A median difference of 0 and a value of 0 for both the upper and lower 95% CI for the median difference were also calculated, and this contrasts with the results from the WMW test, where a significant difference was detected. A comparison of the number of cysts with zero counts showed that there were significantly more (χ^{2}-statistic = 28.23, degree of freedom: 1, *P* < 0.01) cysts with zero counts in the control group than in the steroid-treated group [percent difference: 32.7%, 95% CI: (21.1%, 44.3%)].

#### Regression models.

Table 2 shows the parameter estimates, CIs, *P* values, and BIC for each of the regression models described above. As expected, linear regression gave the same information as the *t*-test but additionally yielded a BIC value that could be compared with those obtained from other regression models. The Poisson model showed that the average number of cysts in kidneys in the steroid-treated group were almost 11 times [RR: 10.64, 95% CI: (6.28, 18.03), *P* < 0.01] higher than kidneys not treated with steroids.

The negative binomial model gave the same rate ratio as the Poisson model but with larger SEs, resulting in a wider CI than the Poisson model. As expected, the overdispersion parameter in the negative binomial model was significant (*P* < 0.01), and the BIC values clearly indicated that the negative binomial model resulted in an improvement in fit to the data (450.68 compared with 667.07 for the Poisson model).

For the ZIP model, the zero inflation parameter was significant (*model 4*: *P* = 0.04), and the odds of a kidney having zero cysts was 0.20 [95% CI (0.08, 0.51), *P* < 0.01] times lower in the steroid-treated group (*model 5*). Given that there are cysts, the rate was over three times [RR: 3.23, 95% CI: (1.51, 6.94), *P* < 0.01] higher on average for kidneys in the steroid-treated group compared with the control group. The additional parameter to model the zero-inflated component in the Poisson model reduced the BIC values greatly (507.94) and even further when the variation between zero inflation of the steroid-treated and control groups was included (505.98), with both providing a much better fit than the simple Poisson model (BIC: 667.07).

The basic negative binomial model (*model 3*) gave a better fit than any of the Poisson models (BIC: 450.68), with significant overdispersion, as expected. The constant zero inflation term (*model 6*) was nonsignificant, and the model that allowed zero inflation to vary according to group showed the difference to be almost significant [odds ratio: 0.18, 95% CI (0.03, 1.02), *P* = 0.05]. However, increasing the complexity of the negative binomial model resulted in a higher BIC value (455.20 and 457.80 for ZINB with and without zero inflation varying by group) as it was penalized by the incorporation of the additional parameters, indicating models that do not provide a more parsimonious fit to the data.

#### Comparison of regression models.

The observed and predicted numbers of cysts for *models 1–3*, *5*, and *7* are shown in Table 3. In the distributions for the control group, the predicted distributions for negative binomial, ZIP, and ZINB were all similar to the observed data. However, for the steroid-treated group, there were differences between the distributions. The normal distribution provided the worst fit to the data, and this was reflected in the BIC value in Table 2. The Poisson distribution also yielded a poor fit, as indicated by both the fitted values and BIC value. The additional zero inflation parameter provided a better fit to the data in the ZIP model but did not fit as well as the negative binomial or ZINB models, in terms of both the BIC value and fitted values shown in Table 3 (steroid-treated group). The BIC values shown in Table 2 indicate that the negative binomial model was the most parsimonious, and therefore preferred, model.

## DISCUSSION

The most commonly used methods for comparing location in two groups are the *t*-test or WMW test. The advantage of permutation tests (7) is that they provide a distribution-free alternative to test differences of group means; however, they are disadvantaged as they are difficult to generalize to a regression setup such as discrete regression modeling. The normality assumption of the *t*-test is inappropriate for highly skewed discrete data, where overdispersion and/or zero inflation can result in a misleading difference between means being detected. Value inflation (namely, zero inflation) can lead to a high proportion of tied values in the location-shift version of the WMW test, resulting in incorrect conclusions being drawn from *P* values. In these cases, tests may lead to conflicting results and may not agree with observations made from exploratory data analyses where the medians do not indicate a difference between groups. While an advantage of *t*-tests and WMW tests are their simplicity and ease of application, they are greatly disadvantaged when applied to count data due to these erroneous assumptions (14). It can be shown using simulation that the WMW test is particularly vulnerable to zero inflation, particularly when the probability of belonging to the zero class is >0.75 and less so to overdispersion. More details on the simulations performed are available upon request from the corresponding author.

Here, we present a number of regression models for discrete data as an alternative to the *t*-tests and WMW tests. Poisson regression models can be used to model the mean number of events, and the negative binomial model includes an extra parameter to allow for overdispersion. A further extension of these models is to model zero inflation in the data, resulting in the ZIP and ZINB models, each with corresponding additional parameters.

While regression analyses using distributions such as the Poisson, negative binomial, ZIP, and ZINB models might be more complex, they model discrete counts, allowing for correct assumptions, and also provide more valid and useful estimates of differences between groups. Software for fitting the models is now widespread, for example, in R and STATA. BIC values are provided alongside parameter estimates and SEs in R, whereas in other packages such as STATA, the log likelihood (ℓ) is given instead, from which the BIC value can easily be calculated as follows: −2ℓ+*k* × log(*n*), where *k* is the number of parameters in the model and *n* is the sample size.

A further advantage is that the regression modeling approach is very flexible. Models can be extended to quantify the associations with additional covariates, which may be continuous (as opposed to the single binary covariate, steroid/control, used in our example data set). In this example, the number of groups could be extended to include more than one type of steroid or a continuous variable, e.g., the growth of the kidneys over the number of days cultured. Interactions between covariates can also be investigated in the models, e.g., we could look at the interaction between the growth of kidneys and steroid-treated groups. A further use of discrete regression is in modeling rates. For counts of events measured in a certain time period, where the time period may vary across observation, dividing the counts of events by the length of a time period gives a rate. In discrete regression modeling, an offset can be included to adjust for the different time periods that has no parameter estimates or *P* values.

In the example presented in this article, both the *t*-test and WMW test indicated that a significant difference exists in the number of cysts between the control and steroid-treated groups. While exploratory data analysis revealed that there was a difference between the means of the two groups, there was no difference in the medians. Of the seven regression models fitted, the BIC values indicated that the negative binomial model provided the best fit to the data. This model showed that the mean number of cysts was significantly higher in the steroid-treated group than in the control group and that there was significant overdispersion present in the data.

The cyst counts on a given day provide a snapshot of each kidney's ability to produce cysts. The underlying capacities of the cells may or may not be identical. The Poisson model assumes that they are the same, whereas the negative binomial model allows for variation. Therefore, the data suggest that the more complex model was appropriate. However, the preference of the negative binomial model over the ZINB model showed that there did not appear to be a subset of kidneys that cannot produce cysts.

This article demonstrates how commonly used tests for the comparison of two groups, namely, the *t*-test and WMW test, can be problematic when dealing with discrete counts. Erroneous assumptions may need to be made for the application of the *t*-test and linear regression to skewed discrete data, hence invalidating results. Less obviously, nonparametric tests such as the WMW test, where value inflation leads to an increased number of ties, may also lead to incorrect conclusions. We present the use of regression modeling as an alternative method for comparing groups with skewed observations, using the Poisson distribution, which models the mean number of events and can be extended to allow for overdispersion and value inflation in the negative binomial, ZIP, and ZINB models. The approach outlined in this article provides a flexible tool for analysing discrete data that avoids making erroneous assumptions and facilitates comparisons between models.

## GRANTS

The Institute of Child Health of University College London receives a portion of funding from the Department of Health's National Institute of Health Research Biomedical Research Centres funding scheme. The Centre for Paediatric Epidemiology and Biostatistics also benefits from funding support from the Medical Research Council (MRC) in its capacity as the MRC Centre of Epidemiology for Child Health (G0400546). F. McElduff acknowledges the support of an MRC Capacity Building Studentship. S.-K. Chan is funded by Kidney Research UK.

## DISCLOSURES

No conflicts of interest, financial or otherwise, are declared by the author(s).

## ACKNOWLEDGMENTS

The authors are grateful to Adrian Woolf and David Long for the helpful comments on a previous version of this article.

- Copyright © 2010 the American Physiological Society