Statistical Guidance for Clinical Trials of Non-Diagnostic Medical Devices |
PREFACE
The Office of Surveillance and Biometrics (OSB) of FDA's Center
for Devices and Radiological Health (CDRH) was established in
July, 1993 to consolidate and focus CDRH postmarket surveillance
programs. A major portion of the OSB mandate is to employ
significant clinical, technical and scientific skills to identify
and resolve public health problems. Towards this goal, the
Office provides statistical, epidemiological, and biometrics
services in support of the major operating programs of the
Center. Reviewing premarket approval applications (PMA) to
assure the safety and effectiveness of marketed medical devices
is a particularly vital part of that support.
The controlled clinical trial is the primary vehicle used to
advance new medical device technology through the PMA approval
process. These investigations provide the basis of valid
scientific evidence that FDA requires to evaluate new medical
device technology. As such, it is critical that a sponsor
correctly plan, conduct and analyze these trials.
The following guidance has been prepared by OSB's Division of
Biostatistics with help from the Center's Office of Device
Evaluation (ODE), academia, and the medical device industry. The
primary purpose of this document is to assist medical device
manufacturers in advancing their product through the premarket
approval process. The guidance is based on expertise and
experience in reviewing data from medical device clinical trials,
and a major FDA workshop on Medical Device Clinical Trials held
in September, 1993.
It is our hope that this document, along with the additional
information and references that have been cited will help
manufacturers save time, money, and human resources in the
planning, conduct, and analysis of medical device clinical
trials.
Larry G. Kessler, Sc.D.
Director,
Office of Surveillance and Biometrics
Your comments and suggestions are welcome. Please address any correspondence regarding this guidance to:
Division of Biostatistics - HFZ-542
Office of Surveillance and Biometrics
FDA/CDRH
9200 Corporate Blvd.
Rockville, MD. 20850
Tel: 301-594-0616
FAX: 301-443-8559
This document is consistent with previously published clinical
study guidance (DHHS, 1987; DHHS, 1990; DHHS, 1992) but provides
a more comprehensive treatment of the clinical trial process from
a statistical perspective. An accompanying guidance covers
clinical aspects of device trials. This guidance describes how a
sponsor should proceed to properly design and conduct a clinical
trial in order to provide a meaningful evaluation and
interpretation of clinical data in support of medical device
Premarket Approval Applications (PMA).
The development of this clinical trial guidance resulted from a
concern about the quality of clinical trials submitted to the
Agency in support of medical device applications. This concern
applied to many critical elements of clinical trial design,
conduct, and analysis and was supported by the findings of the
Committee for Clinical Review chaired by Dr. Robert Temple, Ann
Witt served as co-chair, whose report became publicly available
in March 1993. The CDRH recognized the need for a separate
guidance document to address these concerns, and to clearly
document those elements needed for a well designed, conducted,
and analyzed device clinical trial.
The purpose of this document is to discuss important clinical
trial issues and not to describe the contents of a medical device
submission. It provides an explanation of each particular trial
element and discusses why it should be incorporated into the
clinical trial and what problems may be encountered if it is not
included in the investigation.
The goal of a good clinical trial is to provide the most
objective evaluation of the safety and effectiveness of the
medical device based on its intended claims. Anything in the
design, conduct, and analysis which impairs that objective
assessment lessens the ability of the Agency staff and their
advisory committees to make an informed decision concerning a
"reasonable assurance of safety and effectiveness" for a device.
The cost of any decision in the design, conduct, and analysis of
device clinical trials which may interfere with this objectivity
must be weighed against the cost of delays or disapprovals in the
review process encountered as a result of those decisions.
While this guidance serves as a road map and provides the key
elements of good clinical trial design, conduct, and analysis, it
is by no means exhaustive. Numerous books, only a few of which
have been referenced here, exist on the topic of clinical trial
design and the scientific literature is rich with papers on the
topic.
While the manufacturer may submit any evidence to convince the
Agency of the safety and effectiveness of its device, the Agency
may rely only on valid scientific evidence as defined in the PMA
regulation section entitled, "Determination of Safety and
Effectiveness" (21 CFR 860.7). A thorough reading of that
section is strongly recommended. It should be noted that while
the Agency does not prescribe specific statistical analyses for
given devices and/or situations, all statistical analyses used in
an investigation should be appropriate to the analytical purpose,
and thoroughly documented.
"Valid scientific evidence is evidence from well-controlled
investigations, partially controlled studies, studies and
objective trials without matched controls, well-documented case
histories conducted by qualified experts, and reports of
significant human experience with a marketed device, from which
it can fairly and responsibly be concluded by qualified experts
that there is a reasonable assurance of safety and effectiveness
of a device under its conditions of use "(GPO, 1993).
The regulation further states, "The valid scientific evidence
used to determine the effectiveness of a device shall consist
principally of well-controlled investigations as defined in
paragraph (f) of this section (860.7) unless the Commissioner
authorizes the reliance upon other valid scientific evidence
which the Commissioner has determined is sufficient evidence from
which to determine the effectiveness of the device even in the
absence of well-controlled investigations" (GPO, 1993). From
these passages it is clear the Agency intends to require
well-controlled clinical trials to provide the required reasonable
assurance of safety and effectiveness for medical devices.
"A clinical trial is defined as a prospective study comparing the
effect and value of intervention(s) against a control in human
subjects" (Friedman et al., 1985). In this definition,
intervention is used in the broadest sense to include
"prophylactic, diagnostic, or therapeutic agents, device
regimens, procedures etc." (Friedman et al, 1985).
Additional insight into clinical trials is given in a definition
by Hill (1967), "The clinical trial is a carefully, and
ethically, designed experiment with the aim of answering some
precisely framed question." So, the clinical trial is an ethical
experiment in humans and as such requires informed consent and
Institutional Review Board (IRB) approval. Such considerations
require careful deliberation in the design and conduct of trials.
(This will be further addressed in the accompanying section on
clinical aspects of trials.)
A. The Trial Objective (The Research Question)
An effective and efficient design of a clinical investigation
cannot be accomplished without a clear and concise objective.
Usually the study objective is posed as a research question,
involving the medical claims for the device. This research
question should be formulated with extreme care and specificity.
A question such as "Is my device safe and effective?" is far too
general to be meaningful.
The question must be refined to effectively evaluate a particular
type of intervention. What is the proper way to evaluate
effectiveness in the target condition and population? What are
the unique safety concerns of the device intervention? Is the
device as effective or more effective than another intervention?
If so, is it as safe or safer? Is the evaluation of safety and
effectiveness limited to a particular subgroup of patients? What
is the best clinical measure of safety and effectiveness?
The attempt to answer these and similar questions will provide an
essential focus to the trial and should provide the basis for
labeling indications. For example, if a new device has been
developed to treat a progressive, degenerative ophthalmic
disorder for which there currently exists an alternative therapy
using an approved device, how should effectiveness be determined?
Does the new device slow or halt degeneration? If so, does it
restore functions that had previously been lost? Does it reduce
pain or discomfort? Is it to be compared with the approved
device and is it thought to be as good as or better than the old
device for some purpose? Does it have fewer adverse reactions?
One can see that asking these questions will lead not only to a
focused study objective, but also will require the sponsor to
consider a number of other issues, such as a suitable endpoint or
outcome variable, a control population, the type of hypothesis
that might be tested and others.
These issues must be addressed prior to protocol development,
because one must determine if the stated research question can be
adequately addressed by designing a sound clinical trial. That
is, can we obtain specific and objective answer(s) to the
research question(s) by the collection, analysis, and
interpretation of data from the clinical trial.
B. Pilot or Feasibility Study
If a sponsor cannot answer the key questions necessary to focus
the trial because of insufficient experience with the device in
human populations, then the sponsor should design a limited human
study to gather essential information. The purpose of this
limited study (frequently called a pilot or feasibility study) is
to identify possible medical claims for the device, monitor
potential study variables for a suitable outcome variable, test
study procedures, refine the prototype device, and determine the
precision of those potential response variables. It may also
allow a limited evaluation of factors that may introduce bias. A
protocol for a pilot study should be submitted to the Agency,
usually as an Investigational Device Exemption (IDE) application.
Pilot studies are often used to field test the device. That is,
the sponsor has a good idea of the utility of the device and may
need a limited trial to test a theory or new technique, but the
pilot study should not be too broad, i.e., a "fishing
expedition". A number of issues related to the clinical trial
can be refined including device use, patient processing and
monitoring, data gathering and validation, and physician
capabilities and concerns. Care should be taken to refine the
measurements of critical variables, including potential outcome
variables and influencing variables including potential sources
of bias. However, it should be noted that in situations where
long-term endpoints are needed, these are usually not part of the
pilot study.
Pilot studies allow for limited hypothesis testing and are the
ideal place for exploratory data analyses, i.e., looking for
meaningful relationships between the device and outcome variables
since exploratory methods will often yield research questions
that can be evaluated during the clinical trial.
C. Identification and Selection of Variables
The observations in a clinical study involve two types of
variables: outcome variables and influencing variables. Outcome
variables define and answer the research question and should have
direct impact on the claims for the device. These variables, also
known as response, endpoint, or dependent variables, should be
directly observable, objectively determined measures subject to
minimal bias and error. They should be directly related to
biological effects of the clinical condition and this
relationship itself may need validation. For example, it may be
necessary to perform preliminary laboratory, animal, or limited
human studies to determine that reducing a particular blood value
is in fact clinically meaningful before attempting to study a
device that claims to be safe and effective in decreasing this
value to specific levels.
Influencing variables, also known as baseline variables,
prognostic factors, confounding factors, or independent
variables, are any aspect of the study that can affect the
outcome variables (increase or decrease), or can affect the
relationship between treatment and outcome. Imbalances in
comparison or treatment groups in influencing variables at
baseline can lead to false conclusions by improperly attributing
an effect observed in the outcome variable to an intervention
when it was merely due to the imbalance.
For example, blood pressure generally increases with age. If a
group of individuals in the treatment group is significantly
younger, and possess lower mean pressures than subjects in the
control group, and are then compared using blood pressure as the
outcome variable, the investigators may falsely conclude that an
intervention was responsible for the observed "reduction" in
blood pressure. Appropriate statistical testing of these
baseline values should reveal any significant imbalances between
the two comparison groups before the trial begins.
In the development of a clinical trial design, extreme care
should be taken to identify those influencing variables that are
likely to affect the outcome. By taking such known or suspected
variables into consideration when designing the trial, the
sponsor minimizes the chance that conclusions drawn at the end of
the study will be spurious.
Once the variables or factors to be included in the trial have
been identified, the selection of measurement methods becomes
critical. The most informative and least subjective methods
should be used. Quantitative (continuous) variables are measures
of physical dimension (height, weight, circumference, area,
etc.). Qualitative or categorical (discrete) variables are
measures of distinct states usually represented by whole numbers
(alive or dead, healthy or diseased, tumor classes, etc.).
Quantitative data can contain more information than qualitative
data, and this generally allows for the use of more
mathematically sophisticated and statistically powerful
analytical methods. However, there may be situations where
qualitative data is most appropriate or the only information
available for a specific comparison, and there are many powerful
non-parametric or distribution-free techniques available for
these types of analyses. For example, quality of life
evaluations generally utilize these types of qualitative
analytical approaches.
D. Study Population
The study population should be a representative subset of the
population targeted for the application of the medical device.
The study population should be defined before the trial by the
development of rigorous, unambiguous inclusion/exclusion
criteria. Clinical experts in the field of the device under
investigation should develop these criteria. These
inclusion/exclusion criteria will characterize the study
population and in this way help to define the intended use for
the device.
It is possible to narrowly define a study population such that it
is rather homogeneous in its composition. The advantage of using
a restrictive population is that it allows for a smaller sample
size in the clinical trial. That is, in homogeneous populations,
the variability in responses in general will be smaller than in a
more heterogeneous group, and this reduction in variability, (all
other critical factors being held constant), will result in a
corresponding decrease in the sample size required to observe a
specified significant difference between two groups.
The disadvantage is that it may limit generalization of the
approval to a narrow subset of the general population as defined
by the criteria. Thus, a sponsor should discuss how they intend
to define the study population with the reviewing division in the
Office of Device Evaluation before beginning the clinical trial.
Inclusion/exclusion criteria should include an assessment of
prognostic factors for the outcome variable(s), since one or more
of these variables may influence the effectiveness of the device.
For example, gender may be a prognostic factor for a particular
disease process. It seems reasonable then to assess what role,
if any, that gender might play in device assessment and then
determine inclusion/exclusion criteria, other design, and
analytical considerations accordingly. Consideration should also
be given to: patient age; concomitant disease, therapy or
condition (at both baseline and subsequent follow-up times);
severity of disease; and others.
E. Control Population
Every clinical trial intended to evaluate an intervention is
comparative, and a control exists either implicitly or
explicitly. The safety and effectiveness of a device is
evaluated through the comparison of differences in the outcomes
(or diagnosis) between the treated patients (the group on whom
the device was used) and the control patients (the group on whom
another intervention, including no intervention, was used). A
scientifically valid control population should be comparable to
the study population in important patient characteristics and
prognostic factors, i.e., it should be as alike as possible
except for the application of the device.
There are many types of control groups. For the purposes of this
document, four types are described:
A washout period refers to allowing a period of time to
elapse between the end of one experimental condition
and the beginning of the next condition. The period of
time between the two interventions should be based on
current knowledge of how the device may affect any
anatomical or physiological processes, so that it may
be demonstrated that no residual effects of the first
treatment remain which may confound the results
obtained from the next scheduled treatment.
It should be noted that there will still be instances
where a patient may serve as his/her own control even
if a crossover design is not necessary or appropriate.
For example, a crossover design would not be necessary
when it can be clearly demonstrated that current
clinical consensus has determined that there are no
residual effects of a device beyond the immediate
treatment of the patient.
Concurrent controls and, where applicable, self-controls allow
the largest degree of opportunity for comparability. Passive
concurrent controls can provide comparability only if the
selection criteria are the same, the study variables are measured
in precisely the same way as those in the study sample, and
assuming there are no hidden biases.
The use of historical controls is the most difficult way to
assure comparability with the study population, especially if the
separation in time or place is large. The practice of medicine
and nutrition is dynamic - hygiene and other factors change as
well. Subtle differences (secular trends) in patient
identification, concurrent therapies, or other factors can lead
to differences in outcomes from a standard therapy or diagnostic
algorithm. Such differences in patient selection, therapy or
other factors may not be easily or adequately documented. These
differences in outcome may be mistakenly attributed to a new
intervention when compared to a historical control observed at a
significantly different time and/or place.
In addition, it is often difficult or impossible to ascertain
whether the measurement of critical study variables was
sufficiently similar to those used in the current trial to allow
comparison. It should not be assumed that the measurement
methods are equivalent. For these reasons, historical controls
will usually require much more work to validate comparability
with the study population than would concurrent controls.
F. Methods of Assigning Intervention
A method of assigning treatments or interventions to patients
must minimize the potential for selection bias to enter the
study. Selection bias occurs when patients possessing one or
more important prognostic factors appear more frequently in one
of the comparison groups than in the others. For example, if we
know that the mortality from a condition is twice as likely in
males than in females, and that one group had a two-to-one ratio
of males to females, and a second group had a two-to-one ratio of
females to males, then a difference in mortality will appear
between these two groups with no intervention effect. If an
intervention is assigned to one of these groups, its effect on
mortality will be confounded, i.e., inseparably mixed, by the
effect of gender.
Appropriate steps must be taken to assure that imbalances among
known or suspected prognostic factors are minimized. The
preferred method for protecting the trial against selection bias
is randomization. The process of randomization assigns patients
to intervention or control groups such that each patient has an
equal chance of being selected for each group. If the trial is
large with a limited number of comparison groups, randomization
tends to guard against imbalances of prognostic factors.
It also protects the trial from conscious or subconscious actions
on the part of the study investigators which could lead to
non-comparability, e.g., assigning (or selecting) the most seriously
ill patients to the therapy thought by the physician to be the
more aggressive treatment.
Finally, randomization provides a fundamental basis on which most
statistical procedures are founded. Generally, randomization
methods utilize random number tables, computer generated
programs, etc. Specific methods of randomization with examples
are discussed in textbooks on clinical trials and medical
statistics (Friedman et al, 1985; Fleiss, 1986; Hill, 1967;
Pocock, 1983). The method of randomization used in a trial
should be specified.
On occasion, when trial sizes are small and/or the number of
comparison groups is large, simple randomization may not provide
adequate balance among prognostic factors within comparison
groups. In such situations it may be reasonable to form
subgroups, called strata, by grouping subsets of selected
prognostic variables.
Other methods of treatment assignment can be devised for active
concurrent controls but, unless a true randomization scheme is
used, it is difficult for the sponsor to assure that the
resulting assignments are free from systematic or other possible
biases. For example, assigning the intervention to patients in
some systematic order, say every other or every third patient,
seems random. However, such periodic assignments can sometimes
coincide with cyclical patterns of patient presentation at the
clinic such that imbalances can occur or can lead to selection
bias because the intervention assignment is predictable. Thus,
systematic or patterned intervention assignments are best
avoided.
The intervention assignment process should be routinely monitored
to assure crude balance in the important factors that are known
or suspected to affect outcome. There are grouped randomization
schemes which automatically preserve balance, while other methods
require monitoring and adjustment. Caution must be exercised in
adjusting randomization methods to assure that the random nature
is preserved. For example, some imbalance between intervention
and control group is tolerable because adjustment methods exist
in analysis which can be applied to make the groups comparable.
Large imbalances cannot be adequately adjusted by such techniques
and should be avoided by employing appropriate randomized
assignment.
G. Specific Trial Designs
There are numerous trial designs available to the sponsor. The
choice of a particular design depends on many factors including
the hypotheses to be tested, number and impact of baseline
characteristics on the outcome variable(s); number of study
sites; number of therapeutic or diagnostic categories to be
measured, etc. Some of the more elementary designs are discussed
in this section for reference. More complete discussions of
experimental designs can be found in Cox, (1958) and Cochran and
Cox (1957).
The simplest and most common trial design is the parallel design.
In this design, a patient series from the study population has
its baseline characteristics determined, is assigned one of two
or more interventions, receives the assigned intervention, and is
monitored at specified times after the intervention to determine
outcome. If balance is achieved in the prognostic factors and
follow-up is thorough, the analysis and interpretation from a
parallel design should be straightforward.
The crossover design is a modification of the parallel design
with the patient used as his/her own control. In this design,
each patient is assigned an order (presumably random) in which
two or more interventions are to be given, followed by a period
between interventions (or specimen collections) for a washout of
any carry over effect from the previous intervention. These
assignments should be made by randomization to protect against
hidden or unknown biases. The conduct of a crossover design is
somewhat more complicated than parallel designs and requires
closer monitoring.
Analyses for crossover designs are also more complicated because
the patient's response to any particular intervention is usually
correlated with the response to another intervention. This is
because more than 1 interventions are applied to the same patient
and the response is likely to be influenced heavily by that
patient's individual characteristics. However, patient-to-patient
variability is controlled by employing a crossover
design.
A third design that is applicable in medical device clinical
trials is the factorial design. In a simple version of a
factorial design, patients in the study population are assigned
to one of four groups: one of two interventions under study, a
control intervention or both interventions. Such a trial may be
used if a medical device was being tested against an alternate
therapy, say a drug, and the research question is to determine if
either intervention acting alone was effective, or if in
combination they "interacted" to produce a stronger beneficial or
detrimental effect.
The negative aspect of this design is that it is more complicated
to conduct and the sponsor must assure that investigators are
adhering to the study protocol.
A factorial design may require a larger sample size, but since
this type of design is essentially two clinical trials in one, it
offers an efficiency that should not be overlooked. If a drug
intervention is proposed for a factorial design, the sponsor will
have to adhere to the requirements of the Center for Drug
Evaluation and Research if the drug is not already approved for
the proposed claim.
Other aspects of experimental design, such as blocking or
stratification, may further complicate the evaluation. The
design chosen for a particular study must be the one that is most
applicable to the sponsor's objectives. These objectives may
appropriately result in complicated studies that need to be
developed, monitored, and evaluated carefully. Sometimes, less
complicated designs can be used by limiting the scope of the
trial. Such a move, however, should be very carefully considered
because it will nearly always result in a restriction on the
claims for the device.
H. Masking (or Blinding)
Three of the more serious biases that may occur in a clinical
trial are investigator bias, evaluator bias, and placebo or sham
effect. An investigator bias occurs when an investigator either
consciously or subconsciously favors one group at the expense of
others. For example, if the investigator knows which group
received the intervention, he/she may follow that group more
closely and thereby treat them differently from the control group
in a manner which could seriously affect the outcome of the
trial.
Evaluator bias can be a type of investigator bias in which the
person taking measurements of the outcome variable intentionally
or unintentionally shades the measurements to favor one
intervention over another. Studies that have subjective, or
quality of life, endpoints are particularly susceptible to this
form of bias.
The placebo or sham effect is a bias that occurs when a patient
is exposed to an inactive therapy mode but believes that he/she
is being treated with an intervention and subsequently shows orreports improvement.
To protect the trial against these potential biases, masking
should be used. The degree of masking needed depends on the
strength and seriousness of the potential bias. Single mask
designs shield the patient from knowing what intervention has
been assigned. Double mask trials shield both the patient and
the study investigator.
Third party mask trials allow the patient and investigator to
know the intervention assignment but restrict the evaluator,
i.e., the third party, from knowing, such as in the reading of
imaging films or laboratory tests.
Masking is accomplished by coding the interventions and having an
individual who is not on the patient care team control the key to
breaking the code. The bias introduced by breaches in masking
can be very difficult to assess in the analysis, therefore it is
important not to break the code until the analysis is completed.
The evolution of medical device evaluation has demonstrated that
it is often difficult or impossible to mask the patient or
investigator because a placebo or convincing sham treatment may
not be feasible. In such cases extra care must be exercised by
the study staff to assure that these biases are minimized by
assuring that the evaluator is blinded to the assignment of
patients to a particular intervention or control group.
I. Study Site and Investigator
Because pooling of data across study sites and investigators is
almost always necessary in order to attain the required sample
size, the selection of study sites and investigators is critical
in planning a clinical trial.
The sites that have been selected must have sufficient numbers of
eligible patients who are representative of the target population
for the device. Each site must have facilities that are capable
of processing patients in the manner prescribed by the protocol,
and must have staff who are qualified to conduct the trial. It
should be noted, however, that despite a common protocol and the
best efforts of the study monitor, site effects may be present
which can invalidate pooling the data. A careful analysis to
rule out potential bias due to site effects is an important part
of the investigational protocol.
The principal investigator at each site must be able to recruit
eligible patients to the trial and must be willing to abide by
the procedures established by the protocol. Potential
investigators may overestimate their capabilities to recruit and
process study patients, so a review of the demographics and
records of patients for a recent calendar period is advisable.
If the investigator consistently violates the protocol, the data
from that site cannot be used to establish the safety and
effectiveness of the sponsor's device.
Participating physicians have a primary responsibility to their
patients and must provide for individual patients what they
consider to be the best medical care. While there is no question
a physician must do what is best for the patient, if a specific
treatment regimen happens to violate the protocol, a patient
enrolled in the study becomes disqualified from the trial and
that patient's data cannot be used in the analysis.
The clinical trial is basically an experiment in a human
population and as such differs from the routine practice of
medicine. It should be noted that in many investigations, the
Center may require an intention to treat analysis, which would
record data of disqualified patients as a failure. Clearly, a
relatively small number of patients that are disqualified in an
intention to treat model could have a substantial impact upon the
final analyses.
It should be clear, then, that deviations from the protocol by
particular investigators for individual patients may create
substantial problems for the trial analysis. Ultimately, it is
the sponsor's responsibility to assure investigator compliance
with the protocol. Potential investigators who for whatever
reasons indicate that they may not be willing to strictly adhere
to the protocol throughout the course of the investigation should
not be asked to participate in the clinical trial.
J. Sample Size and Statistical Power
A discussion of sample size and statistical power requires
knowledge of some elementary statistical principles which will be
briefly reviewed here.
The object of the clinical trial is to collect data concerning
the safety and effectiveness of a device in a sample of the
target population. Statistical analysis is then used to infer
relevant information concerning properties of the target
population from the observations of those same properties in the
trial sample. These inferences require that the research
questions be translated into numerical statements of
relationships of those population properties. Tests of the
stated hypotheses should provide unequivocal answers to the
research questions.
For example, if the research question is "For some disease A, is
the mean value of a critical outcome variable after prescribed
treatment, greater for the device-treated group than for the
control group?" Two hypotheses would be formed: a null
hypothesis that states that the mean value of patients post
treatment in the treatment group is equal to (or worse than) that
in the controls; and an alternative (or research) hypothesis that
states that the mean value post treatment in the treatment group
is greater than that in the controls.
There are two types of decision errors that can be made by
inferring results from a sample to the population. If the sample
indicates that the mean is greater in the device treated group
than in the controls (i.e., rejecting the null hypothesis) when
in the population there is no difference between means, a Type I
error (also called an alpha error) is made. If, on the other
hand, the sample indicates no difference between means, (i.e.,
accepting the null hypothesis), when the device mean is actually
greater, then a Type II error is made. The probability of making
a Type II error is also known as Beta error, and statistical
power is defined as 1 - Beta.
The probabilities of these two types of errors factor heavilyinto all sample size calculations for hypothesis tests (see
Section VIII Appendix on Sample Size for a more thorough
discussion). Usually these probabilities are fixed in advance,
giving more weight to the error with the more serious
consequences.
For example, a Type I error occurs if the aim of the trial is to
show that the test device is "better than" the control, and we
falsely reject the null hypothesis, and conclude that the device
may be better than the comparison device, when in fact it is
equivalent or even worse than the control. Conversely, if the
object of the trial is to show that the device mean survival is
"as good as" (really, "no worse than") that of the control, then
it would be more serious to accept a false null hypothesis (a
Type II error).
Additionally, clinical trial hypothesis tests should involve
clinically meaningful differences, that is, those differences in
the outcome variable(s) determined by experts in the medical
community to be clinically significant. The most common sample
size formulas include an estimate of the variability of the
clinically meaningful difference in the numerator and an estimate
of the clinically meaningful difference to be detected in the
denominator. Thus, for a given outcome variable, the larger the
variability, the larger the sample size that will be required.
Similarly, for a given variability, the smaller the clinical
difference to be detected, the larger the sample size.
Meinert (1986) provides an excellent discussion of these
computations for both sample size and power.
Each well-designed clinical trial should have a detailed
protocol, i.e., the comprehensive plan that precisely describes
how the trial is to be conducted and how the clinical data are to
be collected and analyzed.
The protocol may be submitted to the Agency as part of an IDE or
as an IDE supplement, but those study protocols not submitted as
part of an IDE must be included in the submission of the PMA.
The following points should be included in the protocol and
determined before initiating the trial:
If a detailed protocol is established that completely describes
the trial design, relevant methodologies, and the proposed
analysis, then conducting the trial should be straightforward.
However, it will not be simple or routine. It is imperative that
those charged with the oversight of the clinical trial have
contingency plans available for unforeseen problems that may
occur during the trial and have means to rapidly implement those
plans.
Contingency plans should be carefully crafted with the goal of
preserving the integrity of the established design. Any
modification of the protocol may reduce the efficiency of the
design. It is difficult to envision, however, any clinical trial
conducted precisely as it was designed. Therefore, it is wise to
anticipate possible problems and have plans to address them if
they occur.
A. Trial Monitoring
The primary concerns in conducting the clinical trial lie in
assuring that the study subjects are entered, the interventions
assigned, the relevant variables measured (at the appropriate
times), and the data accurately and completely recorded as
specified in the protocol. This requires extreme care by the
trial sponsor to closely monitor the conduct of the trial. A
designated trial monitor should assure compliance with the
protocol and identify potential weaknesses that may require
modification of the protocol.
Clinical trials generally incorporate multiple study sites with
one or more investigators at each location. It is critical to
the integrity of the trial that the monitor assure that each site
and investigator is executing the protocol just as it was
planned.
For example, if a modification of the protocol is thought to be
necessary by one or more investigators and the trial is not
closely monitored, it is possible that each site or investigator
will modify the protocol in his/her own way. This could result
in as many distinct protocol changes as there are sites or
investigators, thus jeopardizing the ability to pool the trial
results.
If the investigator consistently violates the protocol, the data
from that site cannot be used to establish the safety and
effectiveness of the sponsor's device. To avoid this
possibility, the sponsor should establish a mechanism to consider
protocol modification, and appoint a monitor or gatekeeper to
ensure that all sites and investigators make the same
modification at the appropriate time.
B. Baseline Evaluation
Whether or not the clinical trial will use randomization, the
baseline observations should be made on all prospective study
patients before assigning or applying an intervention. The
accurate determination of baseline information on all study
subjects is critical for a number of reasons. It allows:
The assessment of baseline data is instrumental in the
identification of prognostic factors which must be balanced among
intervention groups. That is, the patient's current disease
status; concomitant medication, therapy, or condition; age;
gender; socioeconomic status; prior disease history; and other
factors may affect the outcome variable. The assessment of
baseline data allows for the selection and implementation of
methods that minimize the impact of any potential bias on the
comparison of outcome measures. For example, for those
prognostic factors known to affect outcome, stratification or
balanced allocation can be used at the time interventions are
assigned.
If a prognostic factor is discovered during the course of the
trial and adequate baseline measurements exist, then adjustment
or standardization methods can be employed during data analysis
to minimize the effect of imbalance on comparisons.
C. Intervention
The assignment and application of the intervention should be done
with strict adherence to the protocol. A pre-specified regimen
should be followed on every subject. In so far as it is
possible, every procedure scheduled for the treatment group
should also be scheduled for the control group except for the
active application of the device. If the individual
administering the treatments is masked to the intervention group
assignment, it is more likely that all groups will be treated the
same way.
D. Follow-Up
The follow-up of subjects after intervention extends beyond the
simple scheduling of follow-up appointments for the study
subjects. Mechanisms should be in place to assure a high degree
of subject compliance with the follow-up schedule. Even moderate
deviations in follow-up between comparison groups can lead to
substantial biases in the analysis.
Two characteristics of follow-up are critical: completeness and
duration. Completeness is defined as the proportion of patients
entering the trial who come back for each and every follow-up
appointment. It is extremely important that this proportion be
as close to 100% as possible, because statistical power will
decrease as patients are lost to follow-up. Follow-up
percentages of less than 80% are generally considered poor and
these trials are labeled incomplete. It is also important for
the follow-up percentages to be similar across comparison groups
and across study sites.
Incomplete follow-up is a major concern in analysis. The trial
must have procedures available to trace subjects who fail to
appear for scheduled follow-up. Accounting for patients lost to
follow-up is a critical analytical issue because those patients
may provide the most important information from the clinical
trial, particularly if the outcome in such patients is poor. So,
it is essential to determine the health status of all patients
entered into the trial even for those who do not return to the
clinic for all follow-up appointments.
The duration of follow-up is that period of time after the
intervention during which the study subjects are scheduled to be
observed and evaluated. Follow-up duration must be consistent
with safety and effectiveness claims, i.e., it must equal the
duration of claimed effectiveness and must also be long enough to
accurately estimate the rate of known or suspected adverse
events. The duration of follow-up should also be the same across
comparison groups and study sites.
E. Collection and Validation of Data
Methods for obtaining and verifying the accuracy of all measured
variables in the trial must be in place before the trial begins
and must be monitored for compliance. Each study site must have
sufficient staff with suitable expertise to assure the collection
of valid data. Attention to detail is critical because it is
impossible to retrospectively assess data not taken at the
scheduled time or data taken without adequate precision.
These methods must include quality-control techniques for data
measurement, recording, transfer to electronic media, and
verification. The measurement of trial variables begins with an
unequivocal definition of each variable, condition, or
characteristic to be observed in the trial. Trial staff should
completely understand all defined terms, and care must be taken
to assure consistency across investigators and study sites.
Consistency of trial terminology is also essential for
comparisons with other trials or research studies in the
literature, and for use of historical controls, where
appropriate.
When the clinical trial reaches the analysis stage, except for
deviations that may have unexpectedly occurred during the trial,
the analysis should have been previously determined in the
protocol. The protocol, revised by any alteration made during
the trial, dictates what can or cannot be done with statistical
analysis. In most cases, large biases that have been introduced
by any element of trial conduct and that affect the observations
of the outcome variables cannot be satisfactorily rectified by
statistical adjustment procedures.
A. Validations of Assumptions
Before beginning a detailed statistical analysis it is necessary
to validate the assumptions to be used in the proposed analysis.
Such assumptions include underlying characteristics of the
probability distribution used for hypothesis tests or estimation,
similarity of distribution of prognostic factors among study
sites and comparison groups, and validation of suspected
relationships (dependence) or lack of relationship (independence)
among variables.
It is quite important to validate the distribution and variance
assumptions of the statistical test to be used. A test statistic
possesses the properties of the test only if all assumptions are
valid. For example, if the normal (Gaussian) distribution is
assumed, the data should be tested by appropriate statistical
techniques to be certain that the sample does not deviate
substantially from that which would be predicted by the normal
distribution. If it does, then other more appropriate tests such
as non-parametric (distribution-free) procedures should be used.
Likewise if the test requires equal variance among comparison
groups, an appropriate procedure to detect unequal variances
should be used. If unequal variances are detected, either the
data will have to be adjusted or transformed to account for the
unequal variances, or the statistical test will have to be
modified.
An evaluation of the balance of prognostic factors across
comparison groups and study sites is also necessary. Any
observed imbalances must be adjusted so that the ultimate
comparison is made between comparable samples. Analysis of
covariance is a powerful statistical adjustment tool if the
number of variables that require adjustment is small and the
variables are highly correlated to the response variable. If the
number of variables requiring adjustment is large, it is more
difficult to adequately account for all of them. It is critical
that extreme care be exercised in the conduct of the trial
because in the words of Hill (1967) "to start out without thought
and with all and sundry included, with the hope that the results
can somehow be sorted out statistically in the end, is to court
disaster."
If the analysis assumes that certain prognostic or response
variables are unrelated to outcome, appropriate statistical tests
should be performed to confirm these assumptions. Performing
tests on variables that are assumed to be independent, but are in
fact related, or dependent, can lead to significant errors in
tests of hypotheses.
B. Hypotheses and Statistical Tests
In essence, all comparative analyses result in a hypothesis test.
The report of the analysis should clearly state the hypotheses to
be tested, the statistical tests to be used, and the assumptions
behind the tests. All procedures should be referenced so that
the Agency can validate the procedure.
References should be provided even for common procedures. If any
innovative analytical procedures are developed by the sponsor,
complete documentation of those procedures must accompany the
analysis.
In some instances it may be appropriate to use available
(historical) data to develop a mathematical model of the
progression or other characteristic of a disease or condition.
Data gathered in a clinical trial could be used to "validate" the
model by comparing the projected characteristics of the model
with results obtained during the investigation. These types of
comparisons can be used to form a hypothesis test of the model
characteristics.
C. Pooling
It is almost always necessary for the sponsor to pool study
subjects across investigational sites in order to obtain adequate
sample sizes. Pooling must be justified by testing balance among
prognostic factors and verifying that all clinical procedures
were conducted in the manner prescribed in the protocol. On
occasion, data from a given study site will exhibit
characteristics that make it stand out from the others locations.
The sponsor must investigate all relevant effects due to
investigational site and report on these instances to determine
why that particular site had results that differed.
D. Accountability for Patients
The sponsor should be prepared to use extensive measures to
document the post-trial health status of every patient who was
enrolled in the trial. While it is often not possible to find
all patients, the sponsor must demonstrate that everything
possible was done to attempt to find patients lost to follow-up.
It is not appropriate to coerce the patient against their will to
keep follow-up appointments, but, at the very least, a reasonable
assessment of the morbidity or mortality of the patient should be
made.
Sometimes a determination of safety and effectiveness will hinge
on the differences of a small subset of patients in the
comparison groups. If the number of patients lost-to-follow-up
is large relative to the subset that has been observed to be
different, then our ability to document safety and effectiveness
is substantially weakened.
The Agency will require an analysis of the data by
"intention-to-treat." This is an analysis method in which "the
primary tabulations and summaries of outcome data are by
assigned treatment" (Meinert, 1986). In such analyses, patients
lost-to-follow-up in the intervention and control groups must
be counted as though they actually completed the study in
their assigned group. Since there is no observation of outcome
variable after the time the patient is lost-to-follow-up, the
observation cannot be counted as a success
(and is considered failure).
The impact of intention-to-treat analyses on interventions that
may be effective but for which there is a large number of
patients lost-to-follow-up can be devastating. An observation of
effectiveness in the intervention trial patients who are followed
can be eclipsed entirely by a large number of patients lost to
follow-up whose outcomes are recorded as ineffective. It is
crucial, therefore, to keep the number of patients lost to
follow-up as small as possible.
If we let pi be the proportion surviving two years in the
intervention group and pc be the proportion surviving two years
in the control group, then numerically the hypotheses are stated
as:
Ho: pi = pc
Ha: pi > pc.
In the study population, one of these two conditions is true.
If, based on the data, we reject Ho (and accept Ha) when Ho is
true, we make a Type I statistical error. When we accept Ho (and
reject Ha) based on the data when in fact Ha is true in the study
population, then we make a Type II statistical error. The object
of the sample size estimate is to minimize the chances of making
either of these types of errors.
Probability statements are used to determine the chances of
making Type I or Type II errors. The probability is based on the
distribution of possible values for the outcome variable, or in
the case of our example pi or pc.
In common statistical notation, designates the probability of
making a Type I error, i.e., the probability of rejecting the
null hypothesis when it is true. The probability of the Type II
error, i.e., the probability of accepting the null hypothesis
when it is false, is denoted by . The statistical power of a
test method is the probability that the null hypothesis will be
rejected when it is false. The power is denoted 1-.
where n = the sample size for each comparison group.
(intervention and control)
If the claim is that the two-year survival of the intervention is
"as good as" (or no worse than) the control intervention, then
the object would be to make as small as feasible. When an "as
good as" hypothesis is being tested, the test is attempting to
"prove" the null hypotheses. The failure to reject the null
hypothesis can occur under two conditions: either the two
probabilities are truly not different or they are different but
the sample is too small (too little power to detect the observed
difference). If is small, the power (1- ) to detect the
specified difference d is large. Under the "as good as"
hypothesis it is not unusual for to be 0.1 or even 0.05.
The difference, d, is also dependent on the claim. If the
hypothesis involves the claim of "better than," then d is that
increase in two year survival considered by the medical community
to be clinically meaningful. If the hypothesis involves the
claim of "as good as," then d is that decrease in the two-year
survival considered by the medical community to be clinically
significant.
Whenever possible, the determination of d should be based on
previous data. Where data are not available, it may be necessary
to convene a panel of medical experts to provide a value for d
which is considered by the panel to be reasonable. In either
situation, the sponsor should provide a detailed justification
for the choice of the d used in the calculation.
The final elements of the formula are estimates of the
variability of pi, pc, and p. The term 2pq involves the variability of the difference under the null hypothesis, i.e., pi= pc.
The term (piqi + pcqc) is the variability
of the difference under the alternate hypothesis, i.e., pi > pc.
Updated 1/23/96