Section 2.210 min read

Cohort Studies: Prospective and Retrospective

Core summary

A cohort study follows a group of people over time to see who develops an outcome. Unlike case-control studies, cohort studies start with exposure status and look forward to outcomes, which allows calculation of incidence and relative risk.

Detailed explanation

Cohort studies are the strongest observational design for establishing temporal sequence — proving that exposure came before outcome. You classify participants by exposure status at baseline, then follow them over time to see who develops the outcome. The word 'cohort' comes from the Latin 'cohors,' meaning a group of Roman soldiers who march together — and this is exactly what the design does: it marches a group of people forward through time. A prospective cohort study recruits participants now and follows them into the future. The researcher defines the exposed and unexposed groups at the start, collects baseline measurements, and then monitors the cohort at regular intervals for the development of outcomes. This is considered the gold standard of observational research because: the researcher controls data collection quality, exposures are measured before outcomes occur (eliminating temporal ambiguity), and multiple outcomes can be studied simultaneously. The major disadvantages are cost (potentially millions of dollars), time (may require decades of follow-up for diseases like cancer), and attrition (participants drop out, move away, or die from other causes). A retrospective cohort study uses historical records: you identify a past cohort (for example, all employees who worked at a chemical plant between 2000 and 2010), classify them by exposure status using historical records, and trace their outcomes forward to the present using medical records, death certificates, or registry data. The key insight is that this is NOT a case-control study — you are still following the same cohort logic (exposure → outcome), just using data from the past. Both prospective and retrospective cohort studies can calculate incidence rates, relative risk (RR), absolute risk difference, and number needed to harm (NNH). The key practical differences between prospective and retrospective cohorts are: prospective collects data you design (higher quality, you choose what to measure), costs more, and takes longer. Retrospective uses existing records (faster and cheaper, but limited to whatever variables were recorded, with potential data quality issues like missing data or inconsistent coding). A common confusion is between a retrospective cohort study and a case-control study — they are fundamentally different designs. A retrospective cohort starts with exposure and follows forward (even if using past records). A case-control study starts with the outcome and looks backward. Major threats to cohort studies include attrition bias (if people who leave the study differ systematically from those who stay — for example, sicker patients may drop out more often), confounding by indication (in pharmacoepidemiology, sicker patients receive more aggressive treatment, so the treatment group appears to have worse outcomes), healthy worker effect (employed cohorts are healthier than the general population), and information bias. Strategies to handle confounders include multivariable regression, propensity score matching, and stratification. In later levels of this course, you will learn how to conduct each of these study designs step by step from zero — from writing the protocol to collecting data to analyzing results and writing the manuscript.

Clinical example

You want to study whether night-shift work increases cardiovascular disease risk among nurses. You recruit 1,000 nurses from three hospitals: 500 who work permanent night shifts (the exposed group) and 500 who work day shifts only (the unexposed group). At baseline, you record age, BMI, smoking status, blood pressure, cholesterol, family history, and existing medications. You then follow both groups for 10 years, with annual health assessments and cardiac event monitoring. After 10 years, you find that 40 of 500 (8%) night-shift nurses experienced a major cardiovascular event (MI, stroke, or cardiovascular death) compared to 15 of 500 (3%) day-shift nurses. The relative risk is 8%/3% = 2.67, meaning night-shift nurses have 2.67 times the risk of cardiovascular events. The absolute risk difference is 5%, meaning for every 100 night-shift nurses, 5 extra cardiovascular events occurred compared to day-shift nurses. Because you measured exposure (shift type) before the outcome occurred, you have established temporal sequence — a key requirement for causal inference.

Research example

The Nurses' Health Study (NHS), launched in 1976 by Harvard's Frank Speizer, enrolled 121,700 female registered nurses aged 30-55 across the United States. Participants complete detailed questionnaires every two years about diet, lifestyle, medications, and health outcomes. NHS has now been running for nearly 50 years and has produced over 3,000 peer-reviewed publications. Key findings include: establishing the link between trans fat intake and coronary heart disease, quantifying the relationship between hormone replacement therapy and breast cancer risk, demonstrating that regular aspirin use reduces colorectal cancer risk, and showing that the Mediterranean diet is associated with lower cardiovascular mortality. The study also spawned NHS II (1989, younger nurses) and NHS III (2010, both sexes). The power of a large prospective cohort lies in its ability to study multiple outcomes from the same well-characterized population, making it one of the most cost-effective long-term designs in epidemiology. An important retrospective cohort example: researchers used 15 years of employment records from a semiconductor factory to classify workers by cumulative solvent exposure, then linked to cancer registry data. They found elevated leukemia risk among heavily-exposed workers — a result that would have required an impractically long prospective study to generate.

Knowledge check

Q1. What is the primary advantage of cohort studies over case-control studies?

Q2. What is the difference between a prospective and a retrospective cohort study?

Q3. What measure of association can a cohort study calculate that a case-control study cannot?