A New View of Statistics

© 2000 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home


Generalizing to a Population:
MODELS: IMPORTANT DETAILS continued


 Counts and Proportions as Dependent Variables
If your dependent variable represents a count (e.g., the number of injuries in different sports) or a proportion (e.g., the percent of Type I muscle fibers in a muscle biopsy from different athletes), analysis can be a challenge. Once again, the problem with the usual analyses is the possibility of violation of one or more of the assumptions we have to make when calculating confidence limits or the p value.

What I said on the last few pages about t tests of ordinal variables and t tests of Likert-scale variables applies also to counts: t tests are usually OK, and they will fall over only when you have a small sample size and more than 70% of your subjects score zero counts (because then the sampling distribution of the difference between the means won't be close enough to normal).

When you are fitting lines or curves, you also have to worry about non-uniformity of residuals. With counts, this worry is very real, because the variation in a given count from sample to sample depends on how big the count is. For example, the typical variation (standard deviation) in a count is usually simply the square root of the count, so a count of about 400 injuries varies typically by ±20, whereas a count of about 40 injuries varies typically by ±6. I hope it's obvious that the residuals for injury counts of 400 will therefore be much larger than those for counts of about 40. Rank transformation would fix these non-uniform residuals, but better approaches are available: binomial regression, Poisson regression, square-root transformation and arcsine-root transformation.

Binomial and Poisson Regression
When counts have a smallish upper bound (e.g., the number of injured players in a squad of 24 is at most 24), the counts from sample to sample vary according to what is known as a binomial distribution. When the upper bound is very large compared with the observed values of the count (e.g., the number of spinal injuries in American football each year), the counts have a Poisson distribution. With a good stats program, you can dial up an analysis that uses either of these distributions. The result is a binomial regression or a Poisson regression. In the Statistical Analysis System, you can do these analyses with Proc Genmod. Genmod stands for generalized linear modeling, which is an advanced form of general linear modeling that allows for the properties of non-normally distributed variables such as counts and proportions based on counts.

Don't feel intimidated by binomial and Poisson. Are you happy with the notion that the values of most variables have the bell-shaped normal distribution? OK, counts or proportions of something don't have the normal shape when the counts are small, so we need different mathematics to describe their shapes, and different names for them. As counts get larger, the shapes of the binomial and Poisson distributions tend towards the normal shape. You still have the problem of non-uniform residuals, though, because the variability from observation to observation for larger counts is more (in absolute values) or less (in percentage terms) than for smaller counts. Binomial and Poisson regressions and other forms of generalized linear modeling take care of the non-uniformity. For more on generalized linear modeling, in particular the specification and use of distributions and link functions, read this message I sent to the Sportscience email list in July 2004. .

Square-root and Arcsine-root Transformation
One way to deal with non-uniform residuals is to transform the variable. We've seen that log transformation works for some variables, and rank transformation works for most variables as a last resort. Is there a transformation for counts that will allow us to use normal analyses instead of binomial or Poisson regression? Yes, provided you aren't close to some upper bound in the counts, just use the square root of the counts in the usual analyses. When you've derived the outcome statistic and its confidence limits, assess their magnitudes with Cohen's or my scale of effect sizes, as I explained for rank transformation. You can't back-transform an effect (such as a difference between means) into a count by squaring it, but you can get a feel for the magnitude as a count relative to the mean by adding the value of the effect appropriately to the mean of the square-rooted counts, then squaring it. Square the mean for comparison. Add each of the confidence limits of the effect to the square-rooted mean and square it to get a feel for the precision of the magnitude.

Read the cautionary note about how the value of a back-transformed mean is not the same as the mean of the raw variable. For a simple example, imagine you have a team with only one injury this season and another team with nine injuries. The mean of the raw number of injuries is (1+9)/2 = 5. But the mean of the root-transformed injury counts is (1+3)/2 = 2, and when you square 2 to back-transform it you get 4!

Proportions require an exotic transformation called arcsine-root. To use this transformation, express the proportion as a number between 0 and 1 (e.g., 210 Type I muscle fibers in a biopsy of 542 total fibers represents a proportion of 210/542 = 0.387). Now take the square root and find the inverse sine (arcsine) of the resulting number; in other words, find the angle whose sine is the square root of the proportion. (The angle can be in degrees or radians, where 360 degrees is 2 pi radians.) Use that weird variable in your analysis, but weight each observation by the number in the denominator of the proportion, to ensure that the residuals in the analysis are uniform. You'll have to read the documentation for your stats program to see how to apply a weighting factor. To gauge magnitude of effects with an arcsine-root transformed variable, apply the Cohen or Hopkins scale, as explained for rank transformation. The appropriate standard deviation is the root-mean square error from the analysis of the transformed variable, because this error should take into account the weighting factors. As is the case for counts, back-transformation of the observed effect works only if you add the effect appropriately to the mean before taking its sine and squaring it. Multiply the result by 100 if you want it as a percent. Do the same with the confidence limits.

The square root and arcsine-root transformations work well even for low counts or zero proportions. As with ordinal variables, you'll get into trouble only with small sample sizes when more than 70% of your subjects have a score of zero or a proportion of zero. Then you have to use binomial or Poisson regression.

Phew! The square-root and arcsine-root approaches are complex. I recommend that you come to terms with a stats package that offers binomial and Poisson regression or generalized linear modeling.


Go to: Next · Previous · Contents · Search · Home
webmaster
Last updated 19 Aug 2004