For trials that reject the null hypothesis, why do people still talk about statistical power?


Hello! There are many ELI about statistical power, but I didn’t see one asking quite what I’m wondering about – the basic question is:

If power is related to the number of people required to detect a difference between groups, is it important that a trial was underpowered if it found a statistically significant difference between groups?

It seems like maybe not(?) since at that point you can no longer have made a Type 2 error, only Type 1? But being underpowered is often discussed, so I’m curious why.

In: Mathematics

I’m not sure what studies you’re reading, but I agree that it would be odd for someone to say “Our trial rejected the null, but it was underpowered, so we can’t come to a clear conclusion.” In fact, it seems more likely they would say “Our trial rejected the null despite being underpowered, so we likely found a very strong relationship.”

That said, there’s a different reason why power can be important when the null is rejected. If your sample is so large that you have enough power to detect and reject on even tiny differences between treatment and control, then it’s possible to report a statistically significant difference that is of little consequence. For example, we usually don’t have the power to detect that an experimental treatment increases survival rates by 0.000001%. Supposing we somehow did, it’s unlikely that treatment would be worthwhile, even though its effect was “significant.” This is rarely an issue in clinical trials but can come up when people are studying large datasets like the Census.

TLDR: when you are underpowered, there are ways in which risk of false positive can creep in. So even if you reject your null, you likely have excess confidence and the risk of a false positive is higher than your p-value.

– As u/Twin_Spoons said, when you are underpowered you can only reject the null when there are “big” effect sizes. In many situations, there are limits to plausible effect sizes and thus if you reject the null due to a HUGE effect size, you likely have more doubt than the p-value would indicate. The result could be due to a few outliers, cross-treatment contamination, flawed experimental design, etc.

– If your sample size is low, the distribution of standard “test statistics” (t-values, z-scores) is less likely to be “well behaved” and this is why people use methods called “bootstrapping”. The smaller your sample, the more pressure on assumptions on the data, and again this can lead to increased false positive risk compared to a standard p-value.