Reported by Nancy Nelson
July 8, 2002
Background
Epidemiology studies attempt to uncover the patterns and causes
of disease in groups of people.
Epidemiologists gather data from a wide variety of sources in order
to develop a comprehensive picture of health problems around the
world. These are the kinds of questions they address:
-
What parts of the community have the highest rates of disease?
Why?
-
What are the environmental, lifestyle, and genetic factors
that increase a person's chance of developing a disease?
-
What is the natural history of a disease? Does it develop rapidly?
What are the chances of surviving the disease? What factors
influence survival?
-
Are there any prevention or screening measures that can improve
the disease outcome?
-
What data are needed to help shape public health policy aimed
at regulating harmful substances in the environment? Are specific
occupations at risk for toxic exposures? Are the current levels
of chemicals in the air, water, or soil hazardous to human health?
Populations, not the individual, are the focus of epidemiology
research. Studies involving cancer risk, for example, cannot predict
whether an individual with a certain exposure or genetic alteration
will develop cancer, but can estimate the likelihood that a certain
proportion of people (e.g., one out of 100) will develop cancer.
Kinds of Epidemiological Studies
The purpose of many epidemiological studies in cancer research is
to see whether a specific agent (arsenic or benzene) or exposure
(sunlight or smoking) is likely to cause disease. Two general approaches
can be used to test a hypothesis--experimental and
observational studies.
In experimental studies, the investigator varies a factor
to see its effect on the disease. The investigator may be testing
whether a substance causes cancer in animals, whether a screening
procedure saves lives, or whether a particular drug prevents cancer
or prolongs human life. Human experimental trials are called clinical
trials and are used to evaluate agents for the prevention or treatment
of diseases, including cancer, but for ethical reasons cannot be
used to test suspected carcinogens (cancer-causing substances).
In observational studies, on the other hand, there is no
intervention on the part of the investigator. In epidemiology, observational
studies are more common than experimental ones, particularly if
an investigator wants to determine whether an agent or exposure
causes cancer in humans. Two primary kinds of observational studies
are cohort and case-control.
Cohort Studies: The investigator begins with populations
that have different exposures to a particular factor. A cohort
is a group of people who are followed over time to see whether
a disease develops.
In prospective cohort studies, a population is assembled
and then followed over time to see if those with the highest exposure
to a particular factor develop the disease at a higher rate than
those with lower exposure. Two examples of this kind of study
are the Framingham Study of cardiovascular disease, which began
in 1948, and the Nurses' Health Study, which began in 1976 and
looks at the effect of lifestyle factors on disease.
In historic cohort studies (sometimes called retrospective
cohort studies), investigators use data from past records
to identify those who have been exposed to a particular factor
and those who have not. With this information, the researchers
then attempt to determine which members of the population have
developed the disease up to the present time. Retrospective cohort
designs have often been used in occupational studies. By looking
at the past employment histories and job descriptions of workers
from historical records, scientists can make estimates of worker
exposure to potential carcinogens over time. The exposure data
is combined with more recent information on whether workers developed
a disease. A recent study on the health effects of radar exposure
on Korean War veterans who were followed for 40 years is an example
of this type of study. A source of ready-made cohorts are the
members of large pre-paid insurance plans such as Kaiser-Permanente
and the Health Insurance Plan of Greater New York, which have
uniform methods for recording, storing, and retrieving data about
health.
Case-control Studies: In case-control studies,
the investigator begins with people diagnosed as having a disease
(cases) and compares them to people without the disease (controls).
Using data from a variety of sourcespersonal interviews,
medical and hospital recordscases and controls are compared
with regard to a particular exposure in their past. The purpose
is to determine if the two groups differ in the proportion of
people who are exposed to a specific factor. Epidemiologists may
compare a variety of factors such as smoking, diet, medicines,
genetic components, sun exposure, or hormone levels. One high-profile
case-control study in 1980 investigated the association between
artificial sweeteners and bladder cancer in humans (Hoover et
al., 1980). The investigators found that the proportion of people
who used artificial sweeteners was the same in both cases (43.1%)
and controls (42.5%). Their results did not confirm the positive
association seen in animal studies.
Strengths and Limitations of Observational Studies
Many diseases, including certain types of cancer, are rare. For
example, every year in the United States childhood cancers develop
in about 10-20 out of every 100,000 children, and brain tumors occur
at a rate of about 4-5 for every 100,000 people under age 65. Cohort
studies are inappropriate for rare diseases such as these because
a very large cohort is required, which would make the study prohibitively
expensive; a smaller size cohort, on the other hand, would not have
enough cases to allow valid comparisons among possible risk factors.
For these reasons, case-control studies, which can be smaller and
less expensive to carry out, have been the design of choice for
rare cancers. For example, the link between adenocarcinoma of the
vagina in young women and exposure to DES (diethylstilbestrol),
a drug once given to some mothers during pregnancy, was made on
the basis of a 1971 study involving 8 cases and 32 controls (Herbst
et al., 1971). Most case-control studies, however, are composed
of several hundred, or a few thousand, subjects.
Case-control studies often rely on information from past events.
The source of these data may include interview or questionnaire
information from the patient, relative, or physician, or from medical
records. One disadvantage of case-control studies is that this information
may not be available since people simply may not be able to recall
events that occurred a long time ago. In addition, the recall of
certain information may be biased. Bias is common in patients (cases)
or relatives who may have a different recollection of past events
than the controls because they are searching for explanations for
a particular disease.
Selection of controls is an important issue in case-control studies.
They can be selected from patients in hospitals, physicians' practices,
clinics, or from the general population. Random-digit-dialing, in
which investigators generate random lists of telephone numbers within
a certain geographical area (for example, within the same city where
cases were diagnosed) is used extensively to select population-based
controls. It is crucial that the investigator try to choose controls
without introducing bias. In one study, for example, investigators
found an association between coffee drinking and coronary heart
disease using hospital patients as controls (Jick et al., 1973).
This finding wasn't corroborated in further reports. One explanation
was that in the original study some of the hospital controls may
have been advised against coffee drinking for medical reasons. So,
if that were true, the cases were not drinking more coffee than
expected, but the controls were drinking less.
Compared to case-control studies, some of the drawbacks to prospective
cohort studies are that they are long and costly, and subject to
losing a high percentage of patients over the length of the study.
For studies of cancer, it is usually necessary to wait years, often
decades, after the cohort is assembled until enough cancer events
have occurred for reliable statistical analysis. However, cohort
studies also have decided advantages. Unlike case-control studies,
where a risk factor, such as a blood biomarker, for example, could
be the result of the disease, rather than the cause of it, in cohort
studies it is usually known that the exposure occurs before the
disease develops. Also, prospective cohort studies can show associations
of risk factors with diseases other than the one under investigation.
For example, prospective cohort studies of smokers and non-smokers
were designed to determine the association of smoking with lung
cancer, but showed that smoking is also associated with emphysema,
coronary heart disease, peptic ulcer, and cancers of the larynx,
oral cavity, esophagus, and urinary bladder. Another advantage of
cohort studies is that the true relative risks and incidence rates
of disease can be determined. This is in contrast to case-control
studies where the incidence rates are often not known. (See discussion
of incidence and relative risk below.)
A summary of the advantages and disadvantages of case-control and
prospective cohort studies is listed in the chart below:
|
CASE-CONTROL |
PROSPECTIVE COHORT |
Advantages |
- Can be less expensive
- Smaller number of people
- Time to carry out study is shorter
- Suitable for rare diseases
|
- More efficient for studying rare exposures
- Less bias in risk factor data
- May find associations with other diseases
- Yields incidence rates as well as relative risk
|
Disadvantages |
- Incomplete information about past events
- Biased recall of exposures may occur
- Problems of selecting controls and matching variables
- May yield only relative risk (odds ratio)
|
- Large numbers of subjects required
- Long follow-up period
- Problem of attrition
- Changes over time in criteria and methods
- Very Costly
|
How to Quantify Risks
Rates
Rates express the probability that a disease will occur in a defined
population over a specified period of time. Rates provide a way
of comparing different-sized populations with each other. They are
expressed as a ratio (a numerator divided by a denominator). If
the numerator is the number of people that develop the disease during
a specific time period, and the denominator is the number of people
at risk for developing the disease in a specific time period, the
rate is called the incidence rate. If the numerator
is the number of people whose death was caused by the disease, and
the denominator is the number of people at risk of dying from the
disease in a specific time period, the rate is called the mortality
rate.
Incidence = |
Number of individuals that develop the disease during a specific
time
Number of individuals at risk of developing disease during a specific
time
|
For example: |
|
During 1999 |
Breast Cancer |
Prostate Cancer |
|
Number of cases |
1,391 |
1,748 |
|
Total population under study |
1,000,000 |
1,000,000 |
|
Incidence rates |
.001391 |
.001748 |
|
|
Rates can provide useful clues to causes of disease. The observation
that the rates of cervical cancer were higher among prostitutes
than nuns first suggested that sexual activity was an important
factor in the cause of this cancer. High colon cancer death rates
in eastern Nebraska linked to persons of Czechoslovakian background
eventually led to a correlation with nutritional factors.
"Rate" and "risk" are often used interchangeably, but these two
terms are not the same. Risk generally represents a probability,
such as "The lifetime risk of getting breast cancer is 11%." Rate
generally represents a risk over a certain time interval, such as
"The annual incidence rate for breast cancer in women between the
ages of 45 and 50 is about 0.002." This means that, in a population
of 100,000 women ages 45-49, about 200 women would be diagnosed
with breast cancer over a 1-year period.
Relative Risk
Relative risk (sometimes called the rate ratio or risk ratio) is
the measure of risk in an exposed population compared to the risk
in the non-exposed group. The relative risk can tell you if there
is an association between a factor/exposure and a disease, the strength
of the association, and whether the factor/exposure increases or
decreases the risk of disease.
The relative risk is expressed as the incidence rate for persons
exposed to a factor divided by the incidence rate for those not
exposed.
Relative Risk = |
Incidence of the exposed
Incidence of the non-exposed group
|
Using the chart below: |
Relative Risk = |
a/a+b
c/c+d |
= |
Cases exposed/cases and controls in exposed Cases
not exposed/cases and controls in non-exposed group |
|
|
Cases
(with disease) |
Controls
(without disease) |
Exposure |
a |
b |
No exposure |
c |
d |
|
Example:
Here is data from a hypothetical cohort study (such as the Framingham Study)
to determine whether smoking is a risk factor for coronary heart disease.
|
|
Cases |
Controls |
Smokers |
84 |
2916 |
Non-smokers |
87 |
4913 |
|
Relative Risk = |
Incidence of the exposed
Incidence of the non-exposed |
= |
84/84+2916
87/87+4913 |
= |
84/3000
87/5000 |
= |
.028
.017 |
= |
1.64 |
|
|
|
Expressing relative risk:
Another way of describing a relative risk of 1.64 is to say that
the smokers in the hypothetical example above have a 64% greater
chance of developing coronary heart disease than the non-smokers
(1.64 - 1.00 = .64 =64%).
If the relative risk were 2.64, then smokers would have a 164%
greater chance of developing coronary heart disease than the non-smokers
(2.64 - 1.00 = 1.64 = 164%). This risk can also be described as
smokers having a 2-3 times greater chance of developing heart disease
than the non-smokers.
Interpreting relative risk:
If the relative risk is equal to 1, the risk in the exposed
population equals the risk in the non-exposed population, and there
is no association between the exposure and the disease.
If the relative risk is greater than 1, the risk in the
exposed persons is greater than the risk in non-exposed persons,
and a positive association exists. More information is needed to
be certain that the exposure caused the increase in risk.
If the relative risk is less than one, the risk in the exposed
population is less than the risk in non-exposed population. This
is evidence that the exposure may be protective against the disease.
Odds Ratio
In case-control studies, it is often not possible to calculate the
relative risk because the total populations at risk (a+b and c+d
in example above) are not known. The relative risk can be estimated
from case-control studies, however, if certain assumptions are made
(i.e., that the controls are representative of the general population,
the cases are representative of all cases, and the frequency of
the disease in the population is small). With these assumptions,
a very good estimate of the relative risk for case-control studies
can be made using the odds ratio. It is called the
odds ratio because it represents the odds of having the disease
with the risk factor present compared to the odds of having the
disease with the risk factor absent.
|
The odds ratio a/b÷c/d = a×d/c×b,
where a, b, c, and d are defined below: |
|
Disease |
|
Cases |
Controls |
Exposure |
a |
b |
No exposure |
c |
d |
|
|
Example: Data from a study to determine whether tonsilectomy
is associated with subsequent development of Hodgkin's disease (Vianna et al., 1971) is
used to calculate an odds ratio below. There were 109 controls and 109 cases.
|
|
Hodgkin's Disease |
|
Cases (yes) |
Controls (no) |
Prior tonsillectomy (yes) |
67 |
43 |
No prior tonsillectomy (no) |
34 |
64 |
|
The odds ratio is: |
a × d c × d |
= |
67 × 64 43 × 34 |
= |
2.9. |
|
|
The researchers reported that the odds ratio for people with a tonsillectomy was 2.9. This means that the relative risk of developing
Hodgkin's disease is nearly three times greater for those with a prior tonsillectomy than for those with intact tonsils. |
|
Absolute Risk
The absolute risk is a rate that describes the incidence of the
disease in a population. For a rare cancer, the incidence or absolute
risk may be 2 cases/100,000 people whereas for a more common cancer,
the incidence may be 150 cases/100,000 people. The absolute risk
is important in interpreting the impact of the agent or exposure
on the general population.
Suppose an exposure has a relative risk of 1.5. If the exposure
were associated with a rare cancer, the absolute risk for that cancer
could increase from 2 cases of cancer/100,000 non-exposed to 3 cases
of cancer/100,000 exposed people (2 cases x 1.5 = 3 cases). However,
if the exposure with relative risk of 1.5 were associated with a
more common cancer, the absolute risk could increase from 150 cases
of cancer/100,000 unexposed people to 225 cases of cancer/100,000
exposed people (150 cases x 1.5 = 225 cases). For a given relative
risk, the greater the magnitude of the absolute risk of the disease,
the greater the number of people that could be affected by the exposure.
Criteria to consider when deciding if an exposure causes the
disease
After epidemiologists estimate an association with a factor/exposure
and a disease (i.e., relative risk or odds ratio), they try to determine
the role of chance or random variation. Often they test for statistical
significance, which is a measure of how likely it is that the
observed association could simply have been due to chance. The conventional
method for evaluating the likelihood that the result is due to chance
is known as the P (probability) value. Statistically significant,
by convention, implies that P is .05 or less, meaning that there
is a 5% or less chance that the observed result could have happened
by chance.* If the P value is greater than 5%, the association is
thought to have too high a probability of having occurred due to
chance to claim that the finding is worthy of special note. The
5% P value is a completely arbitrary historical artifact. Many scientists
are unhappy with this evaluation criterion, but no satisfactory
alternative has been widely adopted.
Another method of evaluating the role of chance is by reporting
a confidence interval. This method is preferred by epidemiologists.
In this case, epidemiologists report the relative risk or odds ratio
estimate along with bounds of specified probability, typically 95%
(the complement of the 5% used for significance). Statistical theory
shows that a 95% confidence interval should include the true value
of the estimate 95% of the time. Often, if the 95% confidence interval
includes 1.0, the relationship between the exposure and disease
is considered not significant. In a recent case-control study reported
in the June 2002 issue of New England Journal of Medicine,
scientists reported an odds ratio of 1.0 and a 95% confidence level
of 0.8-1.2 to describe the breast cancer risk associated with oral
contraceptives for 35-44 year old women. Because the confidence
interval included 1.0, they concluded that oral contraceptives did
not significantly increase the risk of breast cancer for these women.
Other factors to consider:
If a statistically significant association between the agent/exposure
and disease is found, it is still not clear whether the association
is causal. (A common example of a strong association, but no causality,
is the rooster crowing and the rising of the sun.) Additional factors
are considered to determine whether the association is real. These
include:
-
Magnitude of risk: Relative risks or odds ratios
less than 2.00 are viewed with caution. Small relative risks
are sometimes difficult to interpret.
-
Dose-response: Scientists feel more confident
of a causal association if increased exposure levels result
in increased risk.
-
Consistency across studies: When the same degree
of risk is seen in similar kinds of studies, there is stronger
support for a causal relationship.
-
Time considerations: The appearance of the disease
should occur at a biologically appropriate period of time after
the exposure.
-
Biological plausibility: There should be a biologically
plausible mechanism to explain the occurrence of the disease
after the exposure.
-
Confounding factors: The presence of a confounding
factor makes it falsely appear that there is a causal relationship
between a risk factor and a disease. For example, smoking was
a confounder in a study looking at whether coffee drinking was
a risk factor for pancreatic cancer (MacMahon, 1981). It appeared
that coffee drinking caused pancreatic cancer when, in fact,
they are only associated because they are both linked to smoking.
(Smoking is a confounder because it is a known risk factor for
pancreatic cancer, and is associated with coffee drinking.)
Age, educational level, and socioeconomic status are common
confounding variables because each is often related to the exposure
and disease under study. One way to control for confounding
is by selecting controls so that they are similar to the cases
in specific characteristics. This is called "matching" and it
enables the investigator to cancel out the effects of these
variables. Age, sex and race are the most common matching variables.
Another method of controlling for potential confounding is to
adjust the relative risk for the influence of the potential
confounder.
-
Biases: Biases are flaws in the study design
that prejudice the results. These may be methods used to collect
the data, to select the participants, or to interview the participants.
One classic example of bias occurred with a poll that was taken
in 1936 to predict the outcome of the Presidential election.
Even though the Democrat Franklin Roosevelt won by a landslide,
the pollsters incorrectly predicted a Republican victory. The
poll was biased by the fact that the names were taken from telephone
listings. Phone subscribers in those days were typically wealthier
than non-subscribers, and more likely to vote Republican.
The Importance of Establishing the Cause of a Disease
The weight of all the factors listed above helps epidemiologists
decide if a measured risk is actually real. Establishing a cause
and effect relationship between agent/exposure and disease is very
important for public health, even though it is often difficult to
do. For example, a statistical association between elevated cholesterol
levels and a particular disease may exist but, unless cholesterol
causes the disease, lowering cholesterol levels may have no effect
on decreasing the incidence of the disease
*To be more precise, if you assume there was no association (relative
risk = 1) between a factor and disease, then the P value expresses
the probability that you would find the association that you actually
did find or higher. So, for example, if you found a relative risk
of 3.0, and P=.0004, then, if the relative risk were 1, the probability
you could find a relative risk of 3.0 or higher is .04 %. Since
.04% is very small, the relative risk is probably not equal to one.
References:
Gordis, Leon. Epidemiology published in 1996 by W. B. Saunders Company
Mausner, Judith S., and Kramer, Shira. Epidemiology-An Introductory
Text published in 1985 by W.B. Saunders Company.
### |