National Cancer Institute
U.S. National Institutes of Health | www.cancer.gov

NCI Home
Cancer Topics
Clinical Trials
Cancer Statistics
Research & Funding
News
About NCI
IN THIS ISSUE
An NCI Perspective on Epidemiology

Epidemiology in a Nutshell

Animation/Video

Audio Clips

Photos/Stills

USEFUL CANCER BACKGROUND
Understanding Cancer Series
Show-and-Tell Tutorials

------

NCI Fact Sheets
Briefs on Cancer Topics

------

NewsCenter
Press Releases

------
SEARCH BENCHMARKS
   
  Between these Dates:      
     
     
 
    View All Issues  

MEDIA RESOURCES
Noticias En Español

Understanding Cancer Series

Visuals Online

B-Roll Footage

Radio Broadcasts

Entertainment Resources

Go To Benchmarks Home Page
BenchMarks
------
VOLUME 2, ISSUE 7
------
Epidemiology in a Nutshell


Get Printable Version  printable
Reported by Nancy Nelson
July 8, 2002


Background
Epidemiology studies attempt to uncover the patterns and causes of disease in groups of people.

Epidemiologists gather data from a wide variety of sources in order to develop a comprehensive picture of health problems around the world. These are the kinds of questions they address:

  • What parts of the community have the highest rates of disease? Why?

  • What are the environmental, lifestyle, and genetic factors that increase a person's chance of developing a disease?

  • What is the natural history of a disease? Does it develop rapidly? What are the chances of surviving the disease? What factors influence survival?

  • Are there any prevention or screening measures that can improve the disease outcome?

  • What data are needed to help shape public health policy aimed at regulating harmful substances in the environment? Are specific occupations at risk for toxic exposures? Are the current levels of chemicals in the air, water, or soil hazardous to human health?

Populations, not the individual, are the focus of epidemiology research. Studies involving cancer risk, for example, cannot predict whether an individual with a certain exposure or genetic alteration will develop cancer, but can estimate the likelihood that a certain proportion of people (e.g., one out of 100) will develop cancer.

Kinds of Epidemiological Studies
The purpose of many epidemiological studies in cancer research is to see whether a specific agent (arsenic or benzene) or exposure (sunlight or smoking) is likely to cause disease. Two general approaches can be used to test a hypothesis--experimental and observational studies.

In experimental studies, the investigator varies a factor to see its effect on the disease. The investigator may be testing whether a substance causes cancer in animals, whether a screening procedure saves lives, or whether a particular drug prevents cancer or prolongs human life. Human experimental trials are called clinical trials and are used to evaluate agents for the prevention or treatment of diseases, including cancer, but for ethical reasons cannot be used to test suspected carcinogens (cancer-causing substances).

In observational studies, on the other hand, there is no intervention on the part of the investigator. In epidemiology, observational studies are more common than experimental ones, particularly if an investigator wants to determine whether an agent or exposure causes cancer in humans. Two primary kinds of observational studies are cohort and case-control.

Cohort Studies: The investigator begins with populations that have different exposures to a particular factor. A cohort is a group of people who are followed over time to see whether a disease develops.

In prospective cohort studies, a population is assembled and then followed over time to see if those with the highest exposure to a particular factor develop the disease at a higher rate than those with lower exposure. Two examples of this kind of study are the Framingham Study of cardiovascular disease, which began in 1948, and the Nurses' Health Study, which began in 1976 and looks at the effect of lifestyle factors on disease.

In historic cohort studies (sometimes called retrospective cohort studies), investigators use data from past records to identify those who have been exposed to a particular factor and those who have not. With this information, the researchers then attempt to determine which members of the population have developed the disease up to the present time. Retrospective cohort designs have often been used in occupational studies. By looking at the past employment histories and job descriptions of workers from historical records, scientists can make estimates of worker exposure to potential carcinogens over time. The exposure data is combined with more recent information on whether workers developed a disease. A recent study on the health effects of radar exposure on Korean War veterans who were followed for 40 years is an example of this type of study. A source of ready-made cohorts are the members of large pre-paid insurance plans such as Kaiser-Permanente and the Health Insurance Plan of Greater New York, which have uniform methods for recording, storing, and retrieving data about health.

Case-control Studies: In case-control studies, the investigator begins with people diagnosed as having a disease (cases) and compares them to people without the disease (controls). Using data from a variety of sources—personal interviews, medical and hospital records—cases and controls are compared with regard to a particular exposure in their past. The purpose is to determine if the two groups differ in the proportion of people who are exposed to a specific factor. Epidemiologists may compare a variety of factors such as smoking, diet, medicines, genetic components, sun exposure, or hormone levels. One high-profile case-control study in 1980 investigated the association between artificial sweeteners and bladder cancer in humans (Hoover et al., 1980). The investigators found that the proportion of people who used artificial sweeteners was the same in both cases (43.1%) and controls (42.5%). Their results did not confirm the positive association seen in animal studies.

Strengths and Limitations of Observational Studies
Many diseases, including certain types of cancer, are rare. For example, every year in the United States childhood cancers develop in about 10-20 out of every 100,000 children, and brain tumors occur at a rate of about 4-5 for every 100,000 people under age 65. Cohort studies are inappropriate for rare diseases such as these because a very large cohort is required, which would make the study prohibitively expensive; a smaller size cohort, on the other hand, would not have enough cases to allow valid comparisons among possible risk factors. For these reasons, case-control studies, which can be smaller and less expensive to carry out, have been the design of choice for rare cancers. For example, the link between adenocarcinoma of the vagina in young women and exposure to DES (diethylstilbestrol), a drug once given to some mothers during pregnancy, was made on the basis of a 1971 study involving 8 cases and 32 controls (Herbst et al., 1971). Most case-control studies, however, are composed of several hundred, or a few thousand, subjects.

Case-control studies often rely on information from past events. The source of these data may include interview or questionnaire information from the patient, relative, or physician, or from medical records. One disadvantage of case-control studies is that this information may not be available since people simply may not be able to recall events that occurred a long time ago. In addition, the recall of certain information may be biased. Bias is common in patients (cases) or relatives who may have a different recollection of past events than the controls because they are searching for explanations for a particular disease.

Selection of controls is an important issue in case-control studies. They can be selected from patients in hospitals, physicians' practices, clinics, or from the general population. Random-digit-dialing, in which investigators generate random lists of telephone numbers within a certain geographical area (for example, within the same city where cases were diagnosed) is used extensively to select population-based controls. It is crucial that the investigator try to choose controls without introducing bias. In one study, for example, investigators found an association between coffee drinking and coronary heart disease using hospital patients as controls (Jick et al., 1973). This finding wasn't corroborated in further reports. One explanation was that in the original study some of the hospital controls may have been advised against coffee drinking for medical reasons. So, if that were true, the cases were not drinking more coffee than expected, but the controls were drinking less.

Compared to case-control studies, some of the drawbacks to prospective cohort studies are that they are long and costly, and subject to losing a high percentage of patients over the length of the study. For studies of cancer, it is usually necessary to wait years, often decades, after the cohort is assembled until enough cancer events have occurred for reliable statistical analysis. However, cohort studies also have decided advantages. Unlike case-control studies, where a risk factor, such as a blood biomarker, for example, could be the result of the disease, rather than the cause of it, in cohort studies it is usually known that the exposure occurs before the disease develops. Also, prospective cohort studies can show associations of risk factors with diseases other than the one under investigation. For example, prospective cohort studies of smokers and non-smokers were designed to determine the association of smoking with lung cancer, but showed that smoking is also associated with emphysema, coronary heart disease, peptic ulcer, and cancers of the larynx, oral cavity, esophagus, and urinary bladder. Another advantage of cohort studies is that the true relative risks and incidence rates of disease can be determined. This is in contrast to case-control studies where the incidence rates are often not known. (See discussion of incidence and relative risk below.)

A summary of the advantages and disadvantages of case-control and prospective cohort studies is listed in the chart below:

  CASE-CONTROL PROSPECTIVE COHORT
Advantages
  • Can be less expensive
  • Smaller number of people
  • Time to carry out study is shorter
  • Suitable for rare diseases
  • More efficient for studying rare exposures
  • Less bias in risk factor data
  • May find associations with other diseases
  • Yields incidence rates as well as relative risk
Disadvantages
  • Incomplete information about past events
  • Biased recall of exposures may occur
  • Problems of selecting controls and matching variables
  • May yield only relative risk (odds ratio)
  • Large numbers of subjects required
  • Long follow-up period
  • Problem of attrition
  • Changes over time in criteria and methods
  • Very Costly

How to Quantify Risks

Rates
Rates express the probability that a disease will occur in a defined population over a specified period of time. Rates provide a way of comparing different-sized populations with each other. They are expressed as a ratio (a numerator divided by a denominator). If the numerator is the number of people that develop the disease during a specific time period, and the denominator is the number of people at risk for developing the disease in a specific time period, the rate is called the incidence rate. If the numerator is the number of people whose death was caused by the disease, and the denominator is the number of people at risk of dying from the disease in a specific time period, the rate is called the mortality rate.

Incidence =    Number of individuals that develop the disease during a specific time
Number of individuals at risk of developing disease during a specific time

For example:
  During 1999 Breast Cancer Prostate Cancer
  Number of cases 1,391 1,748
  Total population under study 1,000,000 1,000,000
  Incidence rates .001391 .001748

Rates can provide useful clues to causes of disease. The observation that the rates of cervical cancer were higher among prostitutes than nuns first suggested that sexual activity was an important factor in the cause of this cancer. High colon cancer death rates in eastern Nebraska linked to persons of Czechoslovakian background eventually led to a correlation with nutritional factors.

"Rate" and "risk" are often used interchangeably, but these two terms are not the same. Risk generally represents a probability, such as "The lifetime risk of getting breast cancer is 11%." Rate generally represents a risk over a certain time interval, such as "The annual incidence rate for breast cancer in women between the ages of 45 and 50 is about 0.002." This means that, in a population of 100,000 women ages 45-49, about 200 women would be diagnosed with breast cancer over a 1-year period.

Relative Risk
Relative risk (sometimes called the rate ratio or risk ratio) is the measure of risk in an exposed population compared to the risk in the non-exposed group. The relative risk can tell you if there is an association between a factor/exposure and a disease, the strength of the association, and whether the factor/exposure increases or decreases the risk of disease.

The relative risk is expressed as the incidence rate for persons exposed to a factor divided by the incidence rate for those not exposed.


Relative Risk =  

Incidence of the exposed
Incidence of the non-exposed group

Using the chart below:
Relative Risk = a/a+b
c/c+d
= Cases exposed/cases and controls in exposed
Cases not exposed/cases and controls in non-exposed group

  Cases
(with disease)
Controls
(without disease)
Exposure a b
No exposure c d

Example:
Here is data from a hypothetical cohort study (such as the Framingham Study) to determine whether smoking is a risk factor for coronary heart disease.

  Cases Controls
Smokers 84 2916
Non-smokers 87 4913

Relative Risk = Incidence of the exposed
Incidence of the non-exposed
= 84/84+2916
87/87+4913
= 84/3000
87/5000
= .028
.017
= 1.64
 

Expressing relative risk:
Another way of describing a relative risk of 1.64 is to say that the smokers in the hypothetical example above have a 64% greater chance of developing coronary heart disease than the non-smokers (1.64 - 1.00 = .64 =64%).

If the relative risk were 2.64, then smokers would have a 164% greater chance of developing coronary heart disease than the non-smokers (2.64 - 1.00 = 1.64 = 164%). This risk can also be described as smokers having a 2-3 times greater chance of developing heart disease than the non-smokers.

Interpreting relative risk:
If the relative risk is equal to 1, the risk in the exposed population equals the risk in the non-exposed population, and there is no association between the exposure and the disease.

If the relative risk is greater than 1, the risk in the exposed persons is greater than the risk in non-exposed persons, and a positive association exists. More information is needed to be certain that the exposure caused the increase in risk.

If the relative risk is less than one, the risk in the exposed population is less than the risk in non-exposed population. This is evidence that the exposure may be protective against the disease.

Odds Ratio
In case-control studies, it is often not possible to calculate the relative risk because the total populations at risk (a+b and c+d in example above) are not known. The relative risk can be estimated from case-control studies, however, if certain assumptions are made (i.e., that the controls are representative of the general population, the cases are representative of all cases, and the frequency of the disease in the population is small). With these assumptions, a very good estimate of the relative risk for case-control studies can be made using the odds ratio. It is called the odds ratio because it represents the odds of having the disease with the risk factor present compared to the odds of having the disease with the risk factor absent.

  The odds ratio a/b÷c/d = a×d/c×b, where a, b, c, and d are defined below:
  Disease
  Cases Controls
Exposure a b
No exposure c d

Example:
Data from a study to determine whether tonsilectomy is associated with subsequent development of Hodgkin's disease (Vianna et al., 1971) is used to calculate an odds ratio below. There were 109 controls and 109 cases.
  Hodgkin's Disease
  Cases (yes) Controls (no)
Prior tonsillectomy (yes) 67 43
No prior tonsillectomy (no) 34 64

The odds ratio is: a × d
c × d
= 67 × 64
43 × 34
= 2.9.
The researchers reported that the odds ratio for people with a tonsillectomy was 2.9. This means that the relative risk of developing Hodgkin's disease is nearly three times greater for those with a prior tonsillectomy than for those with intact tonsils.

Absolute Risk
The absolute risk is a rate that describes the incidence of the disease in a population. For a rare cancer, the incidence or absolute risk may be 2 cases/100,000 people whereas for a more common cancer, the incidence may be 150 cases/100,000 people. The absolute risk is important in interpreting the impact of the agent or exposure on the general population.

Suppose an exposure has a relative risk of 1.5. If the exposure were associated with a rare cancer, the absolute risk for that cancer could increase from 2 cases of cancer/100,000 non-exposed to 3 cases of cancer/100,000 exposed people (2 cases x 1.5 = 3 cases). However, if the exposure with relative risk of 1.5 were associated with a more common cancer, the absolute risk could increase from 150 cases of cancer/100,000 unexposed people to 225 cases of cancer/100,000 exposed people (150 cases x 1.5 = 225 cases). For a given relative risk, the greater the magnitude of the absolute risk of the disease, the greater the number of people that could be affected by the exposure.

Criteria to consider when deciding if an exposure causes the disease
After epidemiologists estimate an association with a factor/exposure and a disease (i.e., relative risk or odds ratio), they try to determine the role of chance or random variation. Often they test for statistical significance, which is a measure of how likely it is that the observed association could simply have been due to chance. The conventional method for evaluating the likelihood that the result is due to chance is known as the P (probability) value. Statistically significant, by convention, implies that P is .05 or less, meaning that there is a 5% or less chance that the observed result could have happened by chance.* If the P value is greater than 5%, the association is thought to have too high a probability of having occurred due to chance to claim that the finding is worthy of special note. The 5% P value is a completely arbitrary historical artifact. Many scientists are unhappy with this evaluation criterion, but no satisfactory alternative has been widely adopted.

Another method of evaluating the role of chance is by reporting a confidence interval. This method is preferred by epidemiologists. In this case, epidemiologists report the relative risk or odds ratio estimate along with bounds of specified probability, typically 95% (the complement of the 5% used for significance). Statistical theory shows that a 95% confidence interval should include the true value of the estimate 95% of the time. Often, if the 95% confidence interval includes 1.0, the relationship between the exposure and disease is considered not significant. In a recent case-control study reported in the June 2002 issue of New England Journal of Medicine, scientists reported an odds ratio of 1.0 and a 95% confidence level of 0.8-1.2 to describe the breast cancer risk associated with oral contraceptives for 35-44 year old women. Because the confidence interval included 1.0, they concluded that oral contraceptives did not significantly increase the risk of breast cancer for these women.

Other factors to consider:
If a statistically significant association between the agent/exposure and disease is found, it is still not clear whether the association is causal. (A common example of a strong association, but no causality, is the rooster crowing and the rising of the sun.) Additional factors are considered to determine whether the association is real. These include:

  • Magnitude of risk: Relative risks or odds ratios less than 2.00 are viewed with caution. Small relative risks are sometimes difficult to interpret.

  • Dose-response: Scientists feel more confident of a causal association if increased exposure levels result in increased risk.

  • Consistency across studies: When the same degree of risk is seen in similar kinds of studies, there is stronger support for a causal relationship.

  • Time considerations: The appearance of the disease should occur at a biologically appropriate period of time after the exposure.

  • Biological plausibility: There should be a biologically plausible mechanism to explain the occurrence of the disease after the exposure.

  • Confounding factors: The presence of a confounding factor makes it falsely appear that there is a causal relationship between a risk factor and a disease. For example, smoking was a confounder in a study looking at whether coffee drinking was a risk factor for pancreatic cancer (MacMahon, 1981). It appeared that coffee drinking caused pancreatic cancer when, in fact, they are only associated because they are both linked to smoking. (Smoking is a confounder because it is a known risk factor for pancreatic cancer, and is associated with coffee drinking.) Age, educational level, and socioeconomic status are common confounding variables because each is often related to the exposure and disease under study. One way to control for confounding is by selecting controls so that they are similar to the cases in specific characteristics. This is called "matching" and it enables the investigator to cancel out the effects of these variables. Age, sex and race are the most common matching variables. Another method of controlling for potential confounding is to adjust the relative risk for the influence of the potential confounder.

  • Biases: Biases are flaws in the study design that prejudice the results. These may be methods used to collect the data, to select the participants, or to interview the participants. One classic example of bias occurred with a poll that was taken in 1936 to predict the outcome of the Presidential election. Even though the Democrat Franklin Roosevelt won by a landslide, the pollsters incorrectly predicted a Republican victory. The poll was biased by the fact that the names were taken from telephone listings. Phone subscribers in those days were typically wealthier than non-subscribers, and more likely to vote Republican.

The Importance of Establishing the Cause of a Disease
The weight of all the factors listed above helps epidemiologists decide if a measured risk is actually real. Establishing a cause and effect relationship between agent/exposure and disease is very important for public health, even though it is often difficult to do. For example, a statistical association between elevated cholesterol levels and a particular disease may exist but, unless cholesterol causes the disease, lowering cholesterol levels may have no effect on decreasing the incidence of the disease

*To be more precise, if you assume there was no association (relative risk = 1) between a factor and disease, then the P value expresses the probability that you would find the association that you actually did find or higher. So, for example, if you found a relative risk of 3.0, and P=.0004, then, if the relative risk were 1, the probability you could find a relative risk of 3.0 or higher is .04 %. Since .04% is very small, the relative risk is probably not equal to one.

References:
Gordis, Leon. Epidemiology published in 1996 by W. B. Saunders Company
Mausner, Judith S., and Kramer, Shira. Epidemiology-An Introductory Text published in 1985 by W.B. Saunders Company.

###


A Service of the National Cancer Institute
Department of Health and Human Services National Institutes of Health USA.gov