Section 2.37 min read

Registry-Based and Database Studies

Core summary

Registry-based and database studies use data that was already collected for clinical care, billing, or public health surveillance. They offer enormous sample sizes and real-world generalizability but sacrifice the control and precision of prospective data collection.

Detailed explanation

These studies use data that was already collected — for clinical care, billing, or public health surveillance — and repurpose it for research. They are increasingly central to modern clinical research because electronic health records (EHRs) and administrative databases make vast amounts of data accessible at relatively low cost. It is critical to understand that 'registry-based study' and 'database study' describe the DATA SOURCE — where the data comes from — not the study design. The study design could be cross-sectional, case-control, cohort, or even a type of quasi-experimental design. For example, you can conduct a retrospective cohort study using registry data, or a case-control study using an insurance claims database. The key distinction between a clinical registry and an administrative database lies in their original purpose. A clinical registry (like the STS Cardiac Surgery Database, SEER Cancer Registry, or National Trauma Data Bank) was specifically designed to capture clinical variables for quality improvement or research. Data are entered by trained abstractors using standardized definitions, so clinical detail and accuracy tend to be high. However, registries typically cover a specific disease or procedure, not the full range of a patient's health. An administrative database (like Medicare claims, Medicaid records, or insurance billing data) was designed for billing and reimbursement, not research. It uses ICD and CPT codes to document diagnoses and procedures. These databases cover enormous populations (Medicare covers over 65 million Americans) but suffer from coding inaccuracy — a diagnosis code entered for reimbursement may not match the true clinical picture. An EHR data warehouse sits somewhere in between: it contains clinical notes, lab results, and medications (richer than billing data), but the data was entered for clinical care rather than research, so it may be inconsistent or incomplete. How does this relate to a retrospective cohort study? A retrospective cohort is a study DESIGN — you identify a past cohort, classify by exposure, and follow forward in time to outcome. You can do a retrospective cohort study using ANY of these data sources: a registry, a claims database, or an EHR. The data source determines what variables are available and how reliable they are. The study design determines how you analyze them. Think of it this way: the data source is the KITCHEN (what ingredients you have), and the study design is the RECIPE (how you combine them). Strengths of studies using existing data include massive sample sizes (tens of thousands to millions), long follow-up possible retrospectively, real-world clinical practice (not artificial trial conditions), and low cost per patient. Limitations include data not collected for research (missing confounders like smoking status or BMI), coding inaccuracy (ICD codes assigned for billing may not match true clinical diagnoses), survival bias in prevalent cohorts, limited granularity (cannot determine medication adherence from prescription fills alone), and the inability to verify data through direct patient contact. In later levels of this course, you will learn how to conduct each of these study designs step by step from zero — from writing the protocol to collecting data to analyzing results and writing the manuscript.

Clinical example

You want to study whether patients who receive a particular type of heart valve prosthesis (mechanical vs bioprosthetic) have different 10-year survival rates. Conducting a prospective cohort would require 10+ years and enormous funding. Instead, you access two data sources: 1. The STS National Database (clinical registry): contains detailed operative variables — valve type, concomitant procedures, preoperative ejection fraction, STS risk scores — with high accuracy because data abstractors use standardized definitions. 2. CMS Medicare claims data (administrative database): provides long-term follow-up information — hospital readmissions, subsequent procedures, and mortality — from billing records. By linking these two sources, you create a powerful retrospective cohort study that combines the clinical detail of the registry with the long-term outcomes from administrative data. This is a retrospective cohort study (design) using both a clinical registry and an administrative database (data sources). The STS data tells you exactly what happened in surgery (the kitchen has high-quality ingredients), and the Medicare data tells you what happened afterward (the follow-up recipe).

Research example

The Surveillance, Epidemiology, and End Results (SEER) Program is the premier cancer registry in the United States, covering approximately 48% of the U.S. population across 22 geographic areas. SEER collects detailed data on cancer type, stage, grade, treatment, and survival, with data going back to 1973. Thousands of landmark cancer studies have used SEER data — from establishing 5-year survival benchmarks for every cancer type, to identifying disparities in treatment and outcomes across racial and socioeconomic groups, to evaluating the impact of screening programs on cancer mortality trends over decades. A notable example: researchers used SEER data to demonstrate that the widespread adoption of PSA screening for prostate cancer led to a dramatic increase in early-stage prostate cancer detection but did not clearly reduce prostate cancer mortality — a finding that changed screening recommendations. Another major resource is the UK's Clinical Practice Research Datalink (CPRD), which links primary care EHR data for over 60 million patients to hospital records, death registries, and disease-specific registries. CPRD studies have generated evidence on drug safety that would have been impossible to obtain from clinical trials alone.

Knowledge check

Q1. What is the MAIN advantage of registry-based studies?

Q2. Which is a major limitation of using administrative billing data for research?

Q3. What distinguishes a clinical registry from an administrative database?