Section 3.67 min read

Post-Hoc Tests and Multiple Comparisons

Core summary

When you make many comparisons, the chance of a false positive balloons. Post-hoc tests and corrections (like Bonferroni) keep the overall error rate under control.

Detailed explanation

Every statistical test carries roughly a 5% chance of a false positive (a Type I error). Run many tests and those risks accumulate: with 20 independent comparisons you would expect about one 'significant' result by chance alone even if nothing real is happening. This is the multiple comparisons problem, and it appears whenever you test many subgroups, many outcomes, or all pairs of groups after an ANOVA. Two situations need care. First, post-hoc tests after a significant ANOVA: having learned that some group differs, you compare pairs to find which, using procedures like Tukey's HSD that are designed for all pairwise comparisons and build in the correction. Second, general multiple testing: when you run many tests, you tighten the threshold. The simplest correction is Bonferroni, divide your alpha by the number of tests (for example 0.05 / 10 = 0.005), so each test must clear a stricter bar. Bonferroni is conservative and can miss real effects; alternatives like Holm or the Benjamini-Hochberg false discovery rate (FDR) are less strict and often preferred when there are many comparisons. The deeper protection is planning: pre-specify a small number of primary comparisons rather than testing everything and reporting only what 'worked' (a form of p-hacking). Findings that emerge from many unplanned comparisons should be labeled as hypothesis-generating, not confirmatory. Pitfalls: reporting an unadjusted 'significant' subgroup out of dozens tested; using Bonferroni so aggressively that it destroys power (consider FDR); and forgetting that correction applies to the family of tests, not to a single pre-planned primary comparison. In clinical trials this is handled by design: a single pre-specified primary outcome is tested without penalty, while the many secondary and subgroup analyses are explicitly labeled exploratory or carry a pre-planned correction. When you do correct, match the method to the goal: Bonferroni or Holm to strictly control any false positive, or the Benjamini-Hochberg false discovery rate to tolerate a small, known fraction of false discoveries in exchange for more power, common in high-dimensional work like genomics.

Clinical example

After a significant ANOVA across three doses, a researcher uses Tukey's HSD and finds only the high-versus-low comparison differs; the medium dose is not distinguishable from either.

Research example

A trial tested 15 secondary outcomes and one was 'significant' at p = 0.04. After Bonferroni correction (threshold 0.0033) it is no longer significant, so the authors report it as exploratory.

Knowledge check

Q1. If you run 20 independent tests at p < 0.05 with no real effects, about how many 'significant' results do you expect by chance?

Q2. Using a Bonferroni correction for 10 tests, the significance threshold becomes:

Q3. Which post-hoc test is commonly used for all pairwise comparisons after a significant ANOVA?