Frequency Tables and Proportions
Core summary
Categorical data are summarized not by averages but by counts, proportions, and percentages, usually displayed in a frequency table. Rates and ratios extend the same idea.
Detailed explanation
Detailed explanation
Means and medians are for numbers. Categorical data, diagnoses, sex, smoking status, blood type, recovered-or-not, are summarized in a completely different way: by counting how many fall into each category and turning those counts into proportions and percentages. The basic tool is the frequency table. For a single categorical variable it lists each category, its count (frequency), and its percentage of the total. For example, among 200 patients: 120 non-smokers (60%), 50 ex-smokers (25%), 30 current smokers (15%). That table is the complete, honest summary, no average is possible or meaningful. A proportion is just a count divided by the total (0.15), and a percentage is that proportion times 100 (15%); they are the same information in two forms. A few related terms are worth distinguishing because they are often muddled. A proportion has the part inside the whole (smokers among all patients). A ratio compares two separate counts where the numerator is not part of the denominator (30 smokers to 120 non-smokers, a 1 to 4 ratio). A rate is a proportion with a time element (15 new infections per 1000 patient-days). Keeping these straight matters because they answer different questions and are easy to misreport. When you have two categorical variables, you build a cross-tabulation, or contingency table, which counts every combination. A 2x2 table, for instance exposure yes/no against disease yes/no, is the foundation of much of clinical epidemiology; the risk ratios and odds ratios you will meet later are all calculated from such tables. Reading one well is a core skill: always check whether percentages are calculated across rows or down columns, because 'of smokers, 40% had the disease' and 'of diseased patients, 40% smoked' are very different claims drawn from the same table. Good practice is to report both the count and the percentage, never a percentage alone. '30% improved' is far less informative than '6 of 20 (30%) improved', because a percentage from a tiny sample is fragile: 1 of 3 is 33% but means almost nothing. For categorical data the count is the evidence and the percentage is the convenience, which is why every Table 1 in a clinical paper shows categorical variables as n (%).
Clinical example
A clinic audit of 80 diabetic patients reports foot-examination status as a frequency table: 52 examined (65%), 28 not examined (35%). This single table instantly shows a care gap that a narrative such as 'most patients were examined' would blur.
Research example
A study presents a 2x2 contingency table of vaccination (yes/no) against influenza (yes/no). All the study's headline measures, the attack rate in each group and the relative risk, are read directly off the cell counts of that table.
Knowledge check
Q1. Which summary is appropriate for the variable 'smoking status (never/ex/current)'?
Q2. A report says '30% of patients improved'. Why is 'of 20 patients, 6 (30%) improved' better?
Q3. In a 2x2 contingency table, why must you check whether percentages are by row or by column?