Sample Size for Comparing Two Means
Core summary
To compare a measurement between two groups, the sample size comes from four inputs you already know: significance (alpha), power, the smallest difference worth detecting, and the outcome's variability (SD). Software does the arithmetic; you choose honest inputs.
Detailed explanation
Detailed explanation
One of the most common planning questions is: how many patients do I need to compare a measurement between two groups, say mean blood pressure on a new drug versus standard care? This is the classic two-means (t-test) design, and its sample size comes from four ingredients you already know: the significance level alpha (usually 0.05, two-sided), the power you want (usually 80 to 90%), the smallest difference worth detecting (your minimal clinically important difference, MCID), and the variability of the outcome (its standard deviation, SD). The practical workflow is short, and you never derive a formula by hand. First, decide the MCID, the smallest difference that would actually change practice (not the biggest you hope for). Second, get a realistic SD from a pilot study or a similar published paper. Third, fix alpha and power. Fourth, combine the difference and SD into a standardized effect size, Cohen's d = difference divided by SD; this single number is what most calculators want. Fifth, enter everything into software and read off the required N per group. Finally, inflate that number for expected dropout, because some participants will be lost to follow-up. A feel for the numbers helps. A smaller MCID means you are hunting a subtler effect, so N rises steeply; more variable outcomes (a larger SD) also push N up; higher power and a stricter alpha both increase N. This is why a vague, over-optimistic 'we will detect a huge effect' plan collapses into an underpowered study, realistic inputs usually demand more patients than people expect. Three pitfalls cause most trouble. Guessing the SD instead of sourcing it gives a meaningless N. Forgetting dropout leaves you underpowered when the data arrive. And computing power after the study (post-hoc) to explain a negative result is circular and discouraged, power is a planning tool, used before data collection. Report your assumptions transparently: state the difference, the SD, alpha, power, the software used, and the dropout adjustment, so reviewers can reproduce the calculation. The arithmetic belongs to the software; your job is to choose honest, defensible inputs and to interpret the result against feasibility and ethics.
Clinical example
A cardiology team plans a trial of a new antihypertensive versus standard care, with mean systolic blood pressure at 12 weeks as the outcome. From a similar published trial the SD is about 12 mmHg, and they judge a 5 mmHg difference as the smallest worth detecting. With 80% power and a two-sided alpha of 0.05, a calculator returns about 90 patients per group; allowing for 15% dropout, they plan to recruit about 106 per group.
Research example
Reviewers reject a manuscript whose sample size was justified only as 'based on feasibility'. The authors had found no difference, but with 20 patients per group and an SD of 12 mmHg the study had far too little power to detect their 5 mmHg target, a textbook Type II error that a proper a priori calculation would have prevented.
Knowledge check
Q1. In a two-means sample-size calculation, which input usually requires a pilot study or prior literature?
Q2. A team wants to detect a smaller difference than originally planned, keeping everything else the same. What happens to the required sample size?
Q3. Why should sample size be calculated before the study rather than after?