Power of a test
In
Notation
This article uses the following notation:
- β = probability of a Type II error, known as a "false negative"
- 1 − β = probability of a "true positive", i.e., correctly rejecting the null hypothesis. "1 − β" is also known as the power of the test.
- α = probability of a Type I error, known as a "false positive"
- 1 − α = probability of a "true negative", i.e., correctly not rejecting the null hypothesis
is True | is False | |
---|---|---|
Test Rejects | α | 1-β |
Test Doesn't Reject | 1-α | β |
Description
For a type II error probability of β, the corresponding statistical power is 1 − β. For example, if experiment E has a statistical power of 0.7, and experiment F has a statistical power of 0.95, then there is a stronger probability that experiment E had a type II error than experiment F. This reduces experiment E's sensitivity to detect significant effects. However, experiment E is consequently more reliable than experiment F due to its lower probability of a type I error. It can be equivalently thought of as the probability of accepting the alternative hypothesis () when it is true – that is, the ability of a test to detect a specific effect, if that specific effect actually exists. Thus,
If is not an equality but rather simply the negation of (so for example with for some unobserved population parameter we have simply ) then power cannot be calculated unless probabilities are known for all possible values of the parameter that violate the null hypothesis. Thus one generally refers to a test's power against a specific alternative hypothesis.
As the power increases, there is a decreasing probability of a type II error, also called the
In the context of binary classification, the power of a test is called its statistical sensitivity, its true positive rate, or its probability of detection.
Power analysis
A related concept is "power analysis". Power analysis can be used to calculate the minimum
Rule of thumb
Lehr's[2][3] (rough) rule of thumb says that the sample size (each group) for a two-sided
In a more general sense, one obtains:[4] , with being the
Background
Factors influencing power
Statistical power may depend on a number of factors. Some factors may be particular to a specific testing situation, but at a minimum, power nearly always depends on the following three factors:
- the statistical significance criterion used in the test
- the magnitude of the effect of interest in the population
- the sample sizeused to detect the effect
A significance criterion is a statement of how unlikely a positive result must be, if the null hypothesis of no effect is true, for the null hypothesis to be rejected. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of the data implying an effect at least as large as the observed effect when the null hypothesis is true must be less than 0.05, for the null hypothesis of no effect to be rejected. One easy way to increase the power of a test is to carry out a less conservative test by using a larger significance criterion, for example 0.10 instead of 0.05. This increases the chance of rejecting the null hypothesis (obtaining a statistically significant result) when the null hypothesis is false; that is, it reduces the risk of a type II error (false negative regarding whether an effect exists). But it also increases the risk of obtaining a statistically significant result (rejecting the null hypothesis) when the null hypothesis is not false; that is, it increases the risk of a type I error (false positive).
The magnitude of the effect of interest in the population can be quantified in terms of an effect size, where there is greater power to detect larger effects. An effect size can be a direct value of the quantity of interest, or it can be a standardized measure that also accounts for the variability in the population. For example, in an analysis comparing outcomes in a treated and control population, the difference of outcome means would be a direct estimate of the effect size, whereas would be an estimated standardized effect size, where is the common standard deviation of the outcomes in the treated and control groups. If constructed appropriately, a standardized effect size, along with the sample size, will completely determine the power. An unstandardized (direct) effect size is rarely sufficient to determine the power, as it does not contain information about the variability in the measurements.
The sample size determines the amount of sampling error inherent in a test result. Other things being equal, effects are harder to detect in smaller samples. Increasing sample size is often the easiest way to boost the statistical power of a test. How increased sample size translates to higher power is a measure of the efficiency of the test – for example, the sample size required for a given power.[5]
The precision with which the data are measured also influences statistical power. Consequently, power can often be improved by reducing the measurement error in the data. A related concept is to improve the "reliability" of the measure being assessed (as in
The design of an experiment or observational study often influences the power. For example, in a two-sample testing situation with a given total sample size n, it is optimal to have equal numbers of observations from the two populations being compared (as long as the variances in the two populations are the same). In regression analysis and analysis of variance, there are extensive theories and practical strategies for improving the power based on optimally setting the values of the independent variables in the model.
Interpretation
Although there are no formal standards for power (sometimes referred to as π [citation needed]), most researchers assess the power of their tests using π = 0.80 as a standard for adequacy. This convention implies a four-to-one trade off between β-risk and α-risk. (β is the probability of a type II error, and α is the probability of a type I error; 0.2 and 0.05 are conventional values for β and α). However, there will be times when this 4-to-1 weighting is inappropriate. In medicine, for example, tests are often designed in such a way that no false negatives (type II errors) will be produced. But this inevitably raises the risk of obtaining a false positive (a type I error). The rationale is that it is better to tell a healthy patient "we may have found something—let's test further," than to tell a diseased patient "all is well."[6]
Power analysis is appropriate when the concern is with the correct rejection of a false null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined
Many statistical analyses involve the estimation of several unknown quantities. In simple cases, all but one of these quantities are nuisance parameters. In this setting, the only relevant power pertains to the single quantity that will undergo formal statistical inference. In some settings, particularly if the goals are more "exploratory", there may be a number of quantities of interest in the analysis. For example, in a multiple regression analysis we may include several covariates of potential interest. In situations such as this where several hypotheses are under consideration, it is common that the powers associated with the different hypotheses differ. For instance, in multiple regression analysis, the power for detecting an effect of a given size is related to the variance of the covariate. Since different covariates will have different variances, their powers will differ as well.
Any statistical analysis involving
It is also important to consider the statistical power of a hypothesis test when interpreting its results. A test's power is the probability of correctly rejecting the null hypothesis when it is false; a test's power is influenced by the choice of significance level for the test, the size of the effect being measured, and the amount of data available. A hypothesis test may fail to reject the null, for example, if a true difference exists between two populations being compared by a t-test but the effect is small and the sample size is too small to distinguish the effect from random chance.[7] Many clinical trials, for instance, have low statistical power to detect differences in adverse effects of treatments, since such effects may be rare and the number of affected patients small.[8]
A priori vs. post hoc analysis
Power analysis can either be done before (a priori or prospective power analysis) or after (post hoc or retrospective power analysis) data are collected. A priori power analysis is conducted prior to the research study, and is typically used in
Application
Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis, for example to determine the minimum number of animal test subjects needed for an experiment to be informative. In
Example
The following is an example that shows how to compute power for a randomized experiment: Suppose the goal of an experiment is to study the effect of a treatment on some quantity, and compare research subjects by measuring the quantity before and after the treatment, analyzing the data using a paired
The effect of the treatment can be analyzed using a one-sided t-test. The null hypothesis of no effect will be that the mean difference will be zero, i.e. In this case, the alternative hypothesis states a positive effect, corresponding to The test statistic is:
where
n is the sample size and is the standard error. The test statistic under the null hypothesis follows a
Now suppose that the alternative hypothesis is true and . Then, the power is
For large n, approximately follows a standard normal distribution when the alternative hypothesis is true, the approximate power can be calculated as
According to this formula, the power increases with the values of the parameter For a specific value of a higher power may be obtained by increasing the sample size n.
It is not possible to guarantee a sufficient large power for all values of as may be very close to 0. The minimum (
from which it follows that
Hence, using the quantile function
where is a standard normal quantile; refer to the Probit article for an explanation of the relationship between and z-values.
Extension
Bayesian power
In the
Predictive probability of success
Both
Software for power and sample size calculations
Numerous free and/or open source programs are available for performing power and sample size calculations. These include
- G*Power (https://www.gpower.hhu.de/)
- WebPower Free online statistical power analysis (https://webpower.psychstat.org)
- Free and open source online calculators (https://powerandsamplesize.com)
- PowerUp! provides convenient excel-based functions to determine minimum detectable effect size and minimum required sample size for various experimental and quasi-experimental designs.
- PowerUpR is R package version of PowerUp! and additionally includes functions to determine sample size for various multilevel randomized experiments with or without budgetary constraints.
- R package pwr
- R package WebPower
- Python package statsmodels (https://www.statsmodels.org/)
See also
- Cohen's h – Measure of distance between two proportions
- Effect size – Statistical measure of the magnitude of a phenomenon
- Efficiency – Quality measure of a statistical method
- Neyman–Pearson lemma – Theorem in statistical testing
- Sample size– Statistical way of determining the sample size of a population
- Uniformly most powerful test – Hypothesis test
References
- ^ "Statistical power and underpowered statistics — Statistics Done Wrong". www.statisticsdonewrong.com. Retrieved 30 September 2019.
- ISSN 0277-6715
- ISBN 978-0-470-37796-3.
- ^ Sample Size Estimation in Clinical Research From Randomized Controlled Trials to Observational Studies, 2020, doi: 10.1016/j.chest.2020.03.010, Xiaofeng Wang, PhD; and Xinge Ji, MS pdf
- ISBN 0-521-81099-X.
- ^ Ellis, Paul D. (2010). The Essential Guide to Effect Sizes: An Introduction to Statistical Power, Meta-Analysis and the Interpretation of Research Results. United Kingdom: Cambridge University Press.
- ISBN 978-0521142465.
- PMID 19013761.
- ^ .
- ^ Thomas, L. (1997). "Retrospective power analysis" (PDF). Conservation Biology. 11 (1): 276–280.
Sources
- ISBN 0-8058-0283-5.
- Aberson, C.L. (2010). Applied Power Analysis for the Behavioral Science. ISBN 1-84872-835-2.