Demonstrating Your Program's Worth, METHODS OF EVALUATION

SECTION 3

METHODS OF EVALUATION

Qualitative Methods

Introduction

Personal Interviews

Focus Groups

Participant-Observation

General Information

Quantitative Methods

Introduction

Counting Systems

Surveys

Experimental and Quasi-Experimental Designs

Factors To Be Eliminated as Contributors to Program Results

Schematics for Experimental and Quasi-Experimental Designs

Examples of Experimental Designs

Examples of Quasi-Experimental Designs

Converting Data on Behavior Change into Data on Morbidity and Mortality

Converting Data on Behavior Change into Data on Cost Savings

Summary of Quantitative Methods

Tables

2. Qualitative Methods of Evaluation

3. Advantages and Disadvantages of Methods of Administrating Survey Instruments

4. Relative Risk for Death or Moderate-to-Severe Injury in a Car Crash

5. Quantitative Methods Used in Evaluation

METHODS OF EVALUATION
QUALITATIVE METHODS

INTRODUCTION

Because qualitative methods are open-ended, they are especially valuable at the formative stage of evaluation when programs are pilot testing proposed procedures, activities, and materials. They allow the evaluator unlimited scope to probe the feelings, beliefs, and impressions of the people participating in the evaluation and to do so without prejudicing participants with the evaluator’s own opinions. They also allow the evaluator to judge the intensity of people’s preference for one item or another.

Qualitative methods are also useful for testing plans, procedures, and materials if a problem arises after they are in use. Using these methods, evaluators can usually determine the cause of any problem. Armed with knowledge about the cause, program staff can usually correct problems before major damage is done.

For example, let us say you put an advertisement in the local newspaper offering smoke detectors to low income people. Not as many people respond as you expected, and you want to know why. Conducting formative evaluation using qualitative methods will usually reveal the reason. Perhaps the advertisement cannot be understood because the language is too complex, perhaps your target population seldom reads newspapers, perhaps most people in the target population cannot go to the distribution location because it is not on a public transportation line, or perhaps the problem is due to some other factor. Whatever the cause, once you learn what the problem is, you are in a position to remedy it.

In this section, we describe three methods of conducting qualitative research: personal interviews, focus groups, and participant-observation. Each has advantages and disadvantages.

PERSONAL INTERVIEWS

In-depth personal interviews with broad, open-ended questions are especially useful when the evaluator wants to understand either 1) the strengths and weaknesses of a new or modified program before it is in effect or 2) the cause of a problem should one develop after the program is in effect. Relatively unstructured personal interviews with members of the target population allow interviewees to express their point of view about a program’s good and bad points without being prejudiced by the evaluator’s own beliefs. Open-ended questions allow interviewees to focus on points of importance to them, points that may not have occurred to the evaluator. Personal interviews are particularly important when the target population differs in age, ethnicity, culture, or social background from program staff and when the program staff has a different professional background from those directing the program. Through the interview, the interviewee becomes a partner in, rather than the object of, the evaluation.5

The interviewer's objective is to have as much of the conversation as possible generated spontaneously by the interviewee. For this reason, interviewers must avoid questions that can be answered briefly.

Personal interviews are the most appropriate form of qualitative evaluation when the subject is sensitive, when people are likely to be inhibited speaking about the topic in front of strangers, or when bringing a group of people together is difficult (e.g., in rural areas).

Personal interviews should be audiotaped and transcribed verbatim. Most commonly, evaluators analyze the results of personal interviews by looking through the transcripts for insightful comments and common themes. They then give a written report to program management. Thus, the interviewees’ words become the evaluation data with direct quotes serving as useful supporting evidence of the evaluators’ assessments.

Examples of open-ended questions to ask during personal interviews begin on page 76. See also the focus groups questions (page 81), many of which are suitable for personal interviews.

FOCUS GROUPS

Focus groups serve much the same function as personal interviews. The main difference is that, with focus groups, the questions are asked of groups. Ideally these groups comprise four to eight people who are likely to regard each other as equals.6 A feeling of equality allows all members of the group to express their opinions freely. Focus groups have an advantage over individual interviews because the comments of one participant can stimulate the thoughts and ideas of another. You must conduct several focus groups because different combinations of people yield different perspectives. The more views expressed, the more likely you are to develop a good understanding of whatever situation you are investigating.

As with personal interviews, focus-group discussions should be audiotaped and transcribed verbatim. The evaluator looks for insightful comments and common threads both within groups and across groups and uses direct quotes as the evaluation data. Also as with personal interviews, evaluators analyze the data and prepare a written report for program management. Many of the same questions may be used for personal interviews and for focus groups.

On page 81 are examples of questions that might be used with focus groups during formative evaluation of a program.

PARTICIPANT ANT-OBSERVATION

Evaluation by participant-observation involves having members of the evaluation team participate (to the degree possible) in the event being observed, look at events from the perspective of a participant, and make notes about their experiences and observations. Aspects to observe include physical barriers for participants, smoothness of program operation, areas of success, and areas of weakness. Observers should be unobtrusive and ensure that their activities do not disrupt the program. They should be alert, trained in observational methods, and aware of the type of observations of greatest importance to the program evaluation.

Participant-observation is particularly valuable to the study of behavior for several reasons:

The parties involved in a problem may not realize the effect of their actions or words on other people, or they may not be fully aware of their own reactions to particular situations.

Unlike personal interviews or focus groups, participantobservation can produce information from people who have difficulty verbalizing their opinions and feelings.

Problems of which participants are unaware can come to light. For example, parents may not be aware that an infant car seat is improperly installed and would therefore not report in an interview or focus group that they had difficulty understanding the instructions for installing the seat.

A major disadvantage of participant-observation is that it is time consuming for the evaluator.

Examples of events to observe begin on page 89.

GENERAL INFORMATION

Who To Interview, Invite to Focus Groups, or Observe: If you are evaluating your program’s methods, procedures, activities, or materials, select people similar to those your program is trying to reach. Indeed, you could even select members of the target population itself, if that is possible.

If you are conducting formative evaluation because a large group of people dropped out of the program or refused to join the program, then select people from that group to interview, observe, or invite to focus groups. They are the people most likely to provide information about aspects of the program that need correction.

Number of People To Interview, Focus Groups To Conduct, or Events To Observe: The number depends on the size and diversity of the target population.7 The larger and more diverse the target population, the more interviews, focus groups, or observations are needed. In all cases, the more interviews, observations, or focus groups you conduct, the more likely you are to get an accurate picture of the situation you are investigating.

Trained Evaluator: For several reasons, all qualitative evaluation must be conducted by people trained in the particular method (interview, focus group, or participant observation) being used:

They are experienced in asking open-ended questions (more difficult that you might think) and in probing deeper into a subject when an unexpected situation calls for such probing.

They know how to elicit comments and keep people talking.
They are experienced in encouraging shy people to participate in the conversation and in silencing domineering people.

They are experienced in not showing what they feel or believe about a particular subject or about someone’s response to a question.

They do not bring their own values into the discussion.

They recognize when the discussion has gone far afield of the evaluation’s objectives.

They know when disagreement is productive rather than counterproductive.

Their interest in the results is more impersonal than any program staff member ’s interest would be.

They know how to summarize and present the results in a meaningful way.

See Table 2 for a summary of qualitative methods of evaluation, including the advantages and disadvantages of each.

Table 2. Qualitative Methods of Evaluation

Method

Purpose

Number of People eople To Interview or Events To Observes

Resources Required

Advantages

Disadvantages

Personal Interviews

To have individual, open-ended discussion on a range of issues.

To obtain in-depth information on an individual basis about perceptions and concerns.

The larger and more diverse the target population, the more people must be interviewed.

Trained interviewers

Written guidelines for interviewer

Recording equipment

A transcriber

A private room

Can be used to discuss sensitive subjects that interviewee may be reluctant to discuss in a group.

Can probe individual experience in depth.

Can be done by telephone.

Time consuming to conduct interviews and analyze data.

Transcription can be time-consuming and expensive.

Participants are one-on-one with interviewer, which can lead to bias toward "socially acceptable" or "politically correct" responses.

Focus Groups

To have an open-ended group discussion on a range of issues.

To obtain in-depth information about perceptions and concerns from a group.

4 to 8 interviewees per group.

Trained moderator(s)

Appropriate meeting room

Audio and visual recording equipment

Can interview many people at once.

Response from one group member can stimulate ideas of another.

Individual responses influenced by group.

Transcription can be expensive.

Participants choose to attend and may not be representative of target population.

Because of group pressure, participants may give "politically correct" responses.

Harder to coordinate than individual interviews.

Participant- Observation

To see firsthand how an activity operates.

The number of events to observe depends on the purpose. To evaluate people’s behavior during a meeting may require observation of only one event (meeting). But to see if products are installed correctly may require observation of many events (installations).

Trained observers

Provides firsthand knowledge of a situation.

Can discover problems the parties involved are unaware of (e.g., that their own actions in particular situations cause others to react negatively).

Can determine whether products are being used properly (e.g., whether an infant car seat is installed correctly).

Can produce information from people who have difficulty verbalizing their points of view.

Can affect activity being observed.

Can be time consuming.

Can be labor intensive.

QUANTITATIVE METHODS

INTRODUCTION

Quantitative methods are ways of gathering objective data that can be expressed in numbers (e.g., a count of the people with whom a program had contact or the percentage of change in a particular behavior by the target population). Quantitative methods are used during process, impact, and outcome evaluation. Occasionally, they are used during formative evaluation to measure, for example, the level of participant satisfaction with the injury prevention program.

Unlike the results produced by qualitative methods, results produced by quantitative methods can be used to draw conclusions about the target population. For example, suppose we find that everyone in a focus group (randomly selected from bicyclists in the target population) wears a helmet while riding. We cannot then conclude that all bicyclists in the target population wear helmets. However, let’s say that, instead of a focus group, we conducted a valid survey (a quantitative method) and found that 90% of respondents wear helmets while bicycling, we could then estimate that the percentage of bicyclists who wear helmets in the target population is in the 85% to 95% range.

Next we will explain four quantitative methods: counting systems, surveys, experimental designs, and quasiexperimental designs. We will also describe a method for converting quantitative data on changes in behavior by the target population into estimates of changes in morbidity and mortality (page 64) and into estimates of financial savings per dollar spent on your program (page 66).

COUNTING SYSTEMS

A counting system is the simplest method of quantifying your program’s results and merely involves keeping written records of all events pertinent to the program (e.g, each contact with a member of the target population or each item distributed during a product-distribution program). Counting systems are especially useful for process evaluation (see page 27). Simply design and use forms on which you can record all pertinent information about each program event (see Appendix B for sample forms).

SURVEYS

Description: A survey is a systematic, nonexperimental method of collecting information that can be expressed numerically.

Conducting a Survey: Surveys may be conducted by interview (in person or on the telephone) or by having respondents complete, in private, survey instruments that are mailed or otherwise given to them. Which method to use is determined by the objectives of the survey. For example, if you want to survey businesses or public agencies, the telephone may be best because staff from those organizations are readily accessible by telephone. On the other hand, if you want to survey people who received a free smoke detector, personal visits to their homes may be best since many people in poor areas do not have telephones. In this example, personal visits also have the advantage of allowing you to observe whether the smoke detectors are installed and working properly.

Response rates are generally highest for personal interviews, but telephone and mail surveys allow more anonymity. Therefore, respondents are less likely to bias their responses toward what they believe to be socially acceptable or "politically correct." Telephone surveys are the quickest to conduct and are useful during the development of a program. However, households with telephones are not representative of all households. Indeed, the people we most want to reach with public health programs are often the people most likely not to have telephones.

Purpose of Surveys: While a program is under development, surveys have several uses:

Surveys can identify the aspects of a program that potential users like and dislike before those aspects are put into effect. Such information allows you to modify the aspects that are unlikely to be successful. For example, you might ask people to rate on a scale of 1 to 5 how well they understood the instructions for installing a child’s safety seat. If many people respond that they had difficulty following the instructions, then it is important to clarify the language in the instructions before distributing those instructions on a large scale.

Surveys can gather baseline data on the knowledge, attitudes, and beliefs of the target population. For example, if your goal is to get more people to wear bicycle helmets, you can survey people in the target population, before the program begins, to see how much they know about the value of bicycle helmets, what their attitudes are toward wearing bicycle helmets, and what they believe about bicycle helmets as a safety device.
Surveys can gather baseline data on the rates at which members of the target population engage in behaviors of interest to the program. For example, if your program goal is to reduce the number of people who are injured or die in car crashes because they do not wear seatbelts, you can find out the number of people who already wear seatbelts. Having such information allows you to set a realistic goal for how much you want to increase that number.

After the program is in effect, surveys also have several uses:
Surveys can measure the level of participants’ satisfaction with the program. You can determine whether people in the target population are receiving information about the program, what the most common sources for the information are, and whether the information they are receiving is correct. With such knowledge, you can eliminate the expenses (e.g., cost of newspaper advertisements) for program aspects that are not working.

If the program is having unexpected problems with no clear solution, surveys can often locate the source of the problem, which may then lead to the solution. For example, surveys can show you how people who do not participate in your program differ from those who do. Perhaps you will find that the people who do not participate do not have cars and therefore have difficulty getting to your location. Whatever the reason, once you know what it is, you can modify the program to remove whatever problem you discover.

During impact evaluation, surveys can measure the effect your program is having on the target population’s knowledge, attitudes, beliefs, and behaviors (i.e., how much they have changed since the program began). For example, if your bicycle-helmet program is successful, the target population’s knowledge of and belief in bicycle helmets will have increased, and the attitude toward bicycle helmets will have improved.

During impact or outcome evaluation, surveys can show how many more people report they are engaging in the behavior you are interested in (e.g., how many more people report that they fasten their seatbelts than did so before the program, or how many more people report that they have installed smoke detectors).

Selecting the Survey Population: Who to survey depends in part on the purpose of the survey. To evaluate the level of consumer satisfaction with the program, the survey population may be selected from among those who use the program. To learn about barriers that prevent people from using the program, select a survey population from among people who are eligible to use the program but do not. Before the program is in effect, select from a representative sample of the entire target population to determine what they like or dislike about the program’s proposed procedures, materials, activities, and methods.

In all cases, you will need a complete list of the people or households targeted by the program. Such a list is called a sampling frame. From the sampling frame, you may select the people to be surveyed using statistical techniques such as random sampling, systematic sampling, or stratified sampling. You must use stratified sampling if you want a representative sample of both those who participate in the program and those who do not. A full discussion of sampling techniques is outside the scope of this book. However, several textbooks (e.g., Measurement and Evaluation in Health Education and Health Promotion2) can provide you with information on sampling methods.

Survey Instruments: A survey instrument is the tool used to gather the survey data. The most common one is the questionnaire. Other instruments include checklists, interview schedules, and medical examination record forms.

Methods for Administering Survey Instruments: Before designing a survey instrument, you must decide on the method you will use to administer it because the method will dictate certain factors about the instrument (length, complexity, and level of language). For example, instruments designed to be completed by the respondent without an interviewer (i.e., self-administered) must be shorter and easier to follow than those to be administered by a trained interviewer.

There are three methods for administering survey instruments: personal interview, telephone interview, or distribution (e.g., through the mail) to people who complete and return the questionnaire to the program. The advantages and disadvantages of each method are laid out in Table 3.

The best method to use depends on the purpose of the evaluation and the proposed respondents to the survey. Let’s say, for example, you want to evaluate a training program. If class participants have a moderate level of education, having them complete and return a questionnaire before they leave the classroom is clearly the least expensive and most efficient method. On the other hand, if class participants have problems reading, a questionnaire to be completed in class would not be useful, and you may need to conduct personal interviews.

Likewise, if you are evaluating a program to distribute smoke detectors in a well-defined, low-income housing area, you may need to interview. In this case, face-to-face would be better than telephone interviews, since income is an issue and some poor people do not have telephones.

Table able 3. Advantages and Disadvantages of Methods of Administrating Survey Instruments

Method

Advantages

Disadvantages

Personal interviews

Least selection bias: can interview people without telephones—even homeless people.

Greatest response rate: people are most likely to agree to be surveyed when asked face-to-face.8

Visual materials may be used.

Most costly: requires trained interviewers and travel time and costs.

Least anonymity: therefore, most likely that respondents will shade their responses toward what they believe is socially acceptable.

Telephone interviews

Most rapid method.

Most potential to control the quality of the interview: interviewers remain in one place, so supervisors can oversee their work.

Easy to select telephone numbers at random.

Less expensive than personal interviews.

Better response rate than for mailed surveys.

Most selection bias: omits homeless people and people without telephones.

Less anonymity for respondents than for those completing instruments in private.

As with personal interviews, requires a trained interviewer.

Instruments to be completed by respondent

Most anonymity: therefore, least bias toward socially acceptable responses.

Cost per respondent varies with response rate: the higher the response rate, the lower the cost per respondent.

Less selection bias than with telephone interviews.

Least control over quality of data.

Dependent on respondent’s reading level.

Mailed instruments have lowest response rate.

Surveys using mailed instruments take the most time to complete because such instruments require time in the mail and time for respondent to complete.

General Guidelines for Survey Instruments: When designing a survey instrument, keep in mind that it must appeal as much as possible to the people you hope will respond:

Use language (in the instructions and the questions) at the reading level of the least educated people in your target population.

Avoid abbreviations and terms that may not be easily understood by the target population.

Keep the number of items to the minimum needed to fulfill the requirements of the survey. The more items, the less likely people are to respond.

Make the appearance attractive. Appearance involves such factors as type font, font size, text layout, use of headings, and use of white space. The denser the text and the smaller the print, the less likely people are to respond.

Steps Involved in Designing Survey Instruments: Instrument design is a multistep process, and the steps need to be done in order.

1. Clearly define the population you want to survey. (See page 15, A Description of the Target Population.)

2. Choose the method you will use to administer the survey. (See page 46 for more information.)

3. Develop the survey items meticulously. Survey items are the questions or statements in the survey. Items that are closed-ended are easiest for respondents to complete and least subject to error. Closed-ended items are multiplechoice, scaled, or questions answerable by yes or no or by true or false (See page 104 for examples.)

4. Put items in correct order. Begin with the least sensitive items and gradually build to the most sensitive. Respondents will not answer sensitive questions until they are convinced of the survey’s purpose and have developed a rapport with the "person behind the survey" (the person or group they believe is requesting the information).

Demographic questions such as those about age, education, ethnicity, marital status, and income can be sensitive. For this reason, these questions should be at the end. Not only are they more likely to be answered then, but when a survey has solicited intimate or emotional information, the demographic questions draw respondents’ attention away from the survey’s subject matter and back to everyday activities.

Survey items should progress from general to specific, which eases respondents into a subject and therefore increases the likelihood that they will answer and do so accurately and truthfully. If the survey instrument covers several subjects (e.g., seatbelt use, speeding, and driving while intoxicated), the survey items for each subject should be grouped together, again progressing from general to specific within each group. Put the least sensitive subject first and the most sensitive last.

5. Give the survey instrument an appropriate title. This step is particularly important for survey instruments to be completed by the respondent, since the title is the respondent's first impression of the group collecting the information. To increase the number of responses you get, emphasize the importance of the survey in the title and show any relationship between your injury prevention program and the people you want to respond to the questionnaire. Examples of good titles are "Survey of the Health Needs of Our Community" and "Survey of Your Level of Satisfaction with Our Services."

6. Assess the reliability of the survey instrument. This step involves measuring the degree to which the results obtained by the survey instrument can be reproduced. Assess reliability by one of three methods: 1) determine the stability of the responses given by a respondent, 2) determine the equivalence of responses by one respondent to two different forms of the questionnaire, or 3) determine the internal consistency of the instrument, which is the degree to which all questions in the questionnaire are measuring the same thing.

Following are details on the three methods:

Stability is measured by administering the survey instrument to the same person at two different times (test-retest) and comparing the responses given each time. Do not expect all traits (e.g., attitudes and beliefs) to be stable. For example, enthusiasm for wearing a bicycle helmet may wax and wane throughout a day or over weeks or seasons. Thus, measuring stability may not always be an appropriate way to assess the reliability of a survey instrument.

Equivalence is measured by administering two different forms of the survey instrument (alternate forms) to the same person or set of people and comparing the responses to each. This method of measuring reliability is not often used because of the cost and difficulty of constructing one good survey instrument, let alone two equally strong forms of the instrument.

Internal consistency is measured by comparing the same person’s responses to various items in the survey instrument. If the answer to each item contributes to the respondent’s overall score, then the answers to each question should correlate with the overall score. There are several formulas for calculating the internal consistency of a survey instrument. A discussion of those formulas is outside the scope of this book. See Anastasi’s Psychological Testing9 for more information.

7. Assess the validity of the survey instrument. Validity is the degree to which the instrument measures what it purports to measure. For example, how well data on seatbelt use gathered from questionnaires completed by respondents agree with actual seatbelt use reflects the questionnaire’s degree of validity. Clearly, if data produced by responses to a questionnaire—in this example, the extent of self-reported seatbelt use—cannot be reproduced using a more direct method of gathering data (e.g., counting the number of people who are actually wearing seatbelts), then the questionnaire is not valid. There are three main types of validity: face validity, content validity, and construct validity.

Face validity is the degree to which the instrument appears to measure what it is intended to measure. Face validity is important for good rapport between interviewer (questioner) and respondent. If the interviewer informs the respondent that the survey is about safety habits, but the respondent believes it is about something else, the respondent may distrust the evaluator's intent and may refuse to answer or may not answer truthfully. Assess face validity through pilot tests (e.g., focus groups or personal interviews with a subgroup of the target population) and by having subject-matter experts review the questionnaire.

Content validity is the degree to which all relevant aspects of the topic being addressed are covered by the survey instrument. Assess content validity by having subject-matter experts review the content of the instrument.

Construct validity is the degree to which the survey instrument accurately measures the set of related traits that it is intended to measure. The easiest way to establish construct validity is to compare the results obtained using your instrument with those obtained using a related one for which validity has already been demonstrated.

If no related survey instruments exist, establish construct validity through hypothesis testing. For example, if you developed a survey instrument to determine how often people exceed the speed limit, you could hypothesize that people who most frequently exceed the speed limit are likely to have more traffic citations than people who do not often exceed the speed limit. You could then gather traffic citation data and determine whether the people identified by the survey instrument as the most frequent speeders had more citations, as hypothesized.

8. Pilot test the survey instrument. Before an instrument can be used on the entire target population, you must pilot test it on a group of people similar to the target population or, preferably, on a small group within the target population. The purpose is to determine whether the survey instrument is effective for use with the people who are potential respondents. The evaluator’s job is to find out if any survey items are confusing, ambiguous, or phrased in language unfamiliar to the intended audience. The evaluator will also determine if certain words differ in meaning from one ethnic group to the next and if certain questions are insensitive to the feelings of many people in the target population.

Tip: If the survey instrument is not significantly modified as a result of the pilot test (a rare event), the information gathered from the people who participated in the pilot test can be added to the information obtained from the people in the full survey.

9. Modify. At each step of the design, modify survey items and the survey instrument itself on the basis of information gathered at that step, particularly information gathered during the pilot test.

Many good references are available on the design of survey instruments (see "Bibliography," page 117).

EXPERIMENTAL AND QUASI-EXPERIMENTAL DESIGNS

Introduction: In this section, we discuss research designs that you can use during several stages of evaluation:

During formative evaluation to pilot test particular components of a program. For example, you can determine which of several advertisements is most effective in getting people to participate in your program or which of several media messages is best at making people aware of your program. By knowing which advertisement or message is most effective, you can conserve resources by using them only for the items you know in advance are most likely to work.

During impact evaluation to measure how well a program is influencing knowledge, attitudes, and beliefs. For example, you can measure how much participants’ awareness of the hazards of driving without a seatbelt has increased from what it was before a program to increase seatbelt use began.

During outcome evaluation to measure how well a program met its overall goal. For example, you can measure how many more people are wearing seatbelts than before the program began and, as a consequence, how many lives have been saved and injuries prevented.

How you operate your program will be influenced by how you plan to evaluate it. If you use an experimental or quasiexperimental design, impact and outcome evaluation will be a breeze because, in effect, you will be operating and evaluating the program at the same time.

Experimental Designs: The best designs for impact and outcome evaluation are experimental designs. Evaluation with an experimental design produces the strongest evidence that a program contributed to a change in the knowledge, attitudes, beliefs, behaviors, or injury rates of the target population.

The key factor in experimental design is randomization: evaluation participants are randomly assigned to one of two or more groups. One or more groups will receive an injury intervention, and the other group(s) will receive either no intervention or a placebo intervention. The effects of the program are measured by comparing the changes in the various groups’ knowledge, attitudes, beliefs, behaviors, or injury rates.

Randomization ensures that the various groups are as similar as possible, thus allowing evaluators of the program’s impact and outcome to eliminate factors outside the program as reasons for changes in program participants’ knowledge, attitudes, beliefs, behavior, or injury rates. See "Factors To Be Eliminated as Contributors to Program Results" (page 54) for a full discussion.

Difficulties with Experimental Designs: Although experimental designs are ideal for program evaluation, they are often difficult—sometimes impossible—to set up. The difficulty may be due to logistical problems, budgetary limitations, or political circumstances.

To demonstrate the difficulties, let us consider the example of introducing a curriculum on bicycle safety for third graders at a certain school. Selecting children at random to participate in the program would cause many problems, including the following:

Logistical Problems: The program could not be administered to children in their regular classroom, since (with randomization) not all children in a classroom would be assigned to participate.

Budgetary Problems: Costs would increase if an extra teacher were required to administer the program while other teachers maintained their regular schedules.

Political Problems: Parents might complain if their children were not selected for the "special program." And if, as a result of parent complaints, all children had to participate in the "special program," costs would increase and the value of randomization would be lost.

In addition, evaluation of the program’s effectiveness would be compromised if children in the safety class shared information with the children who were not in the safety class.

Another difficulty with experimental designs is that participants must give their informed consent. People who willingly agree to participate in a program in which they may not receive the injury intervention are probably different from people in the general population. Therefore, program effects shown through evaluation involving randomized studies may not be generalizable (i.e., they may not reflect the probable effects for all people).

For example, let us suppose you want to test how effective a bicycle rodeo is at getting bicyclists to wear helmets. You ask a random sample of 500 children who do not own bicycle helmets to attend a bicycle rodeo you have organized for the following Saturday morning. Let’s say, 300 agree to go. The 200 who do not agree are probably different from the 300 who do agree: perhaps the 200 who do not agree have other activities on Saturday morning (if they are poor, they may work; if they are rich, they may go horseback riding), or they may be rebellious and refuse to listen to adults, or they may believe bicycle helmets and bicycle rodeos are not "cool," or they may have some other reason. Whatever the reason, it makes those who refuse to participate in the study different from those who agree. And because of that difference, the results of your study will not be generalizable to the whole population of children who do not wear bicycle helmets.

Quasi-Experimental Designs: Because of the difficulties with experimental designs, programs sometimes use quasiexperimental designs. Such designs do not require that participants be randomly assigned to one or another group. Instead, the evaluator selects a whole group (e.g., a thirdgrade class in one school) to receive the injury intervention and another group (e.g., the third-grade class in a different school) as the comparison or control group.

As an alternative, if a suitable comparison group cannot be found, the evaluator could take multiple measurements of the intervention group before providing the intervention.

When using quasi-experimental designs with comparison groups, evaluators must take extra care to ensure that the intervention group is similar to the comparison group, and they must be able to describe the ways in which the groups are not similar.

FACTORS TO BE ELIMINATED AS CONTRIBUTORS TO PROGRAM RESULTS

Events aside from the program can produce changes in the knowledge, attitudes, beliefs, and behaviors of your program’s target population, thus making your program seem more successful than it actually was. Therefore, anyone evaluating an injury prevention program’s success must guard against assuming that all change was produced by the program. Experimental designs minimize ( i.e., decrease to the least possible amount) the effects of outside influences on program results; quasi-experimental designs reduce those effects.

The two main factors evaluators must guard against are history and maturation.

History: What may seem like an effect produced by your program, an apparent impact, may often be more accurately attributed to history if the people who participate in your program are different from those who do not. For example, suppose you measured bicycle-helmet use among students at a school that had just participated in your injury-prevention program and also at a school that did not participate. Let us say that more students wore helmets at the school with your program. You have not demonstrated that your program was the reason for difference in helmet use unless you can show that the students at the school with the program did not wear helmets any more frequently before the bicycle-helmet program began than did the students at the school without the program. In other words, you must show that the students at the school with the program did not have a history of wearing helmets more often than did the students at the school without the program.

Maturation: Sometimes events outside your program cause program participants to change their knowledge, attitudes, beliefs, or behavior while the program is under way. Such a change would be due to maturation, not to the program itself. For example, suppose you measured occupant-restraint use by the 4- and 5-year-olds who attended a year-long Saturday safety seminar, both when they began the seminar and when they completed it. Let us say that the children used their seatbelts more frequently after attending the program. You have not demonstrated that the program was effective unless you can also show that seatbelt use by a similar group of 4- and 5-year-olds did not increase just as much simply as a result of other events (e.g., the children’s increased manual dexterity due to development, exposure to a children’s television series about using seatbelts which ran at the same time as the seminar, or a new seatbelt law that went into effect during the course of the seminar).

SCHEMATICS FOR EXPERIMENTAL AND QUASI-EXPERIMENTAL DESIGNS

Introduction: The steps involved in the various experimental and quasi-experimental designs are presented verbally and then in schematic form. In each schematic, we use the same symbols:

R = Randomization

O1 = The first, or baseline, observation (e.g., results of a survey to measure the knowledge, attitudes, beliefs, behaviors, or injury rates of the target population)

O2 = The second observation (O3 = the third, etc.)

X = Intervention

P = Placebo (usually in parenthesis to indicate that a placebo may or may not be used)

The schematic for each intervention and comparison group is shown on a separate line. For example,

O1 X O2

means that there is only one group (one line), that the group is observed for a baseline measurement (O1), provided with the intervention (X), and observed again (O2) to measure any changes.

Another example:

RO1 X O2

RO1 (P) O2

means that people are randomly assigned [R] to one of two groups [two lines]. Both are observed for baseline measurements [O1]. One is provided with the injury intervention [X]; the other may or may not get a placebo intervention [(P)]. Both groups are observed again [O2] for any change.

Definition of Placebo: A placebo is a service, activity, or program material (e.g., a brochure) that is similar to the intervention service, activity, or material but without the characteristic of the intervention that is being evaluated. For example, to test the effectiveness of the content of a brochure about the value of installing smoke detectors, the intervention group will be given the brochure to read and discuss with the evaluator and the comparison group might be given a brochure on bicycle helmets to read and discuss with the evaluator.

To ensure that the placebo conditions are comparable with those of the intervention, evaluators should give the same amount of time and attention to the comparison group as they give to the intervention group.

EXAMPLES OF EXPERIMENTAL DESIGNS

Pretest-Posttest-Control Group Design: Scientists often call this design a true experiment or a clinical trial. These are the steps involved:

1. Recruit people for the evaluation.

2. Randomly assign each person [R] to one of two groups: one group will receive the injury intervention [X] and the other will not [(P)]. To select at random, use a computergenerated list of random numbers, a table of random numbers (found at the back of most books on basic statistics), or the toss of a coin.

3. Observe (measure) each group’s knowledge, attitudes, beliefs, behaviors, injury rate, or any other characteristics of interest [O1]. You could use a survey (page 44), for example, to make this measurement.

4. Provide the program service (the intervention) [X] to one group and no service or a placebo service [(P)] to the other group.

5. Again, observe (measure) each group’s knowledge, attitudes, beliefs, behaviors, injury rates, or whatever other characteristic you measured before providing the program service [O2] .

The schematic for the pretest-posttest-control group design is as follows:

RO1 X O2

RO1 (P) O2

The effect of the program is the difference between

the change from pretest [O1] to posttest [O2] for the intervention [X] group

and

the change from pretest [O1] to posttest [O2] for the comparison [(P)] group.

To clarify, let’s take a hypothetical example of a study you might conduct during formative evaluation. Suppose you want to pilot test a proposed brochure designed to increase people’s awareness that working smoke detectors save lives.

1. Select a group of people at random from the target population. This group is your study [evaluation] population.

2. Randomly assign each person in the study population either to the intervention group or to the comparison group.

3. Test each group to see what the members know about smoke detectors.

4. Decide whether to give a placebo to the comparison group.

5. Show the proposed brochure on smoke detectors only to intervention group members and allow them time to study it. If a placebo is used, show a brochure, perhaps on bicycle helmets, to the comparison group members and allow them to study it. Give the same amount of time and attention to each group.

6. To see if their awareness has increased, test each group again to measure how much they now know about smoke detectors.

Unless the proposed brochure is a dud, the intervention group’s awareness of the benefits of smoke detectors will increase. However, the comparison group’s test scores might also increase because of the placebo effect. For example, the comparison group might develop a rapport with the evaluators and want to please them, thus causing group members to put more thought into their responses during the second observation than they did during the first. In addition, just completing the survey at the first observation may cause them to think or learn more about smoke detectors and give better answers during the second observation.

The effect of the brochure is the difference between the change (usually increase) in the intervention group’s awareness and the change (if any) in the comparison group’s awareness.

Variations on the Pretest-Posttest-Control Group Design: There are several variations on the pretest-posttest-control group design.

The pretest-posttest-control group-followup design is used to determine whether the effect of the program is maintained over time (e.g., whether people continue to wear seatbelts months or years after a program to increase seatbelt use is over). This design involves repeating the posttest at scheduled intervals. The schematic for this design is as follows:

RO1 X O2 O3 O4

RO1 (P) O2 O3 O4

For example, suppose you want to test the effectiveness of counseling parents about infant car seats when parents bring their infants to a pediatrician for well-child care. First, select a target population for the evaluation (e.g., all the parents who seek well-child care during a given week). Then, observe (measure) the target population’s use of safety seats [O1]. Next, randomly assign some parents to receive counseling about car safety seats [X] and the remaining parents to receive a placebo (e.g., counseling on crib safety) [P]. At regular intervals after the counseling sessions, observe each group’s use of infant car seats to see how well the effect of the program is maintained over time (let’s say, 3 months [O2], 6 months [O3], and 9 months [O4]) .

The cross-over design is used when everyone eligible to participate in a program must receive the intervention. Again, participants are randomly divided into two groups. Both groups are tested, but only one receives the intervention. At regular intervals, both groups are observed to see what changes (if any) have occurred in each group. After several observations, the second group receives the intervention, and both groups continue to be observed at regular intervals. Below is an example schematic for this design:

RO1 X O2 O3 O4 O5 O6 O7

RO1 O2 O3 O4 X O5 O6 O7

A program is effective if the effect being measured (e.g., increase in knowledge) changes for Group 1 after the first observation and for Group 2 after the fourth observation.

For example, suppose you wanted to evaluate whether children who took a fire-safety class presented by the fire department had better fire-safety skills than children who did not take the class. To conduct such an evaluation you could, for example, test the fire-safety skills of all the children in the third grade of the local elementary school, then randomly select half of the children (Group 1) to attend the fire-safety class on September 15. You would test the fire-safety skills of all the children again on, say, October 15, November 15, and December 15. In January the other half of the class (Group 2) would attend the fire-safety class. You would again test the fire-safety skills of all the children on January 15, February 15, and March 15. If the class were to increase the children’s firesafety skills, the results of evaluation might look something like this.

The Solomon four-group design is useful when the act of measuring people’s pre-program knowledge, attitudes, beliefs, or behaviors (getting baseline measurements) may affect the program’s goals in one or both of the following ways:

People may change their behavior as a result of being questioned about it. For example, simply asking people how often they fasten their seatbelts may remind them to do so, thereby increasing the use of seatbelts even before any program to increase seatbelt use begins.

People’s interest in a subject may increase simply because they are questioned about it. Such an increase would affect the program’s outcome. For example, simply being questioned about smoke detectors may prime program participants to be more receptive to receiving information about them during a program to prevent house fires.

To compensate for those possibilities, this design expands the pretest-posttest-control group design from two groups (one intervention and one control) to four groups (two intervention and two control). To separate the effect of getting a baseline measurement from the effect produced by the program, the evaluator takes baseline measurements of only one intervention and one control group. The four groups are distinguished from one another as shown below:

Group 1: Provides baseline measurement and receives the intervention.

[RO1 X O2]

Group 2: Provides baseline measurement and receives nothing or a placebo.

[RO1 (P) O2]

Group 3: Provides no baseline measurement and receives the intervention.

[R X O2]

Group 4: Provides no baseline measurement and receives nothing or a placebo.

[R (P) O2]

Since the only difference between Groups 2 and 4 is that Group 2 provided a baseline measurement and Group 4 did not, the evaluator can compare the posttest results (O2) of Group 2 with those of Group 4 to determine the effect of taking a baseline observation (O1).

Similarly, since the only difference between Group 1 and Group 3 is whether they provided a baseline measurement, evaluators can compare their posttest results (O2) to determine whether providing a baseline measurement primed program participants to be more interested in the program’s information, thus increasing the program’s effectiveness.

The schematic for the Solomon four-group design is as follows:

RO1     X     O2

RO1     (P)     O2

R     X     O2

R     (P)     O2

Unfortunately, however, since this variation increases the number of people required for study, it also increases the study’s cost, time, and complexity. As a result, people who are willing to participate in an evaluation with this design may be even less representative of the general population than people who would participate in an evaluation with a less complex, randomized design.

EXAMPLES OF QUASI-EXPERIMENTAL DESIGNS

Here are some examples of quasi-experimental designs. These are useful when a randomized (experimental) design is not possible:

Nonequivalent Control Group Design: Sometimes it is difficult to introduce an injury-prevention program to some people and not to others (e.g., it is impossible to be sure that a radio campaign will reach only certain people in a town and not others). In such a case, the nonequivalent control group design is useful. It is similar to the pretest-posttestcontrol group design except that individual participants are not randomly assigned to separate groups. Instead an entire group is selected to receive the program service and another group not to receive it. For example, a radio campaign could be run in one town but not in a similar town some distance away.

For this example, it is important to select two groups that are well separated geographically in order to reduce the likelihood that the effect of the injury intervention will spill over to the people who are not to receive the intervention. As the name of the design indicates, without randomization the groups will never be equivalent; however, they should be as similar as possible with respect to factors that could affect the impact of the program.

As with the pretest-posttest-control group design, pretest each group [O1]; the result of the pretest shows the degree to which the two groups are not equivalent. Next, provide the intervention to one group [X] and a placebo or nothing [(P)] to the other. Then posttest each group [O2].

Note: The evaluator must look at history, in particular, as a possible way in which the two groups are not equivalent. See page 54 for a discussion of history as an explanation for change.

The schematic for this design is as follows:

O1 X O2

O1 (P) O2

Time Series Design: Sometimes it is impossible to have a control group that is even marginally similar to the intervention group (e.g., when a state program wants to evaluate the effect of a new state law). Although other states may be willing to act as comparison groups, finding a willing state that is similar with respect to legislation, population demographics, and geography is not easy. Furthermore, it is difficult to control the collection of evaluation data by a voluntary collaborator and even more difficult to provide funding to the other state.

The time series design attempts to control for the effects of maturation when a comparison group cannot be found. Maturation is the effect that events outside the program have on program participants while the program is under way. See page 55 for a full discussion on maturation.

To minimize the effect of maturation on program results, take multiple measurements (e.g., O1 through O4) of program participants’ knowledge, attitudes, beliefs, or behaviors before an injury-prevention program begins and enter those measurements into a computer. Then, using special software, you can predict the future trend of those measurements were the program not to go into effect. After the program is over, again take multiple measurements (e.g., O5 through O8) of program participants’ knowledge, attitudes, beliefs, or behaviors to determine how much the actual post-program trend differs from the trend predicted by the computer.

If the actual trend in participants’ knowledge, attitudes, beliefs, or behaviors during the course of the program is statistically different from the computer-predicted trend, then you can conclude that the program had an effect. The major disadvantage to this design is that it does not completely rule out the effect of outside events that occur while the program is under way. For example, this design would not separate the effect of a new law requiring bicyclists to wear helmets from the effect of increased marketing by helmet manufacturers. Although this design cannot eliminate the effects of outside events, it does limit them to those that are introduced simultaneously with the injury-prevention program.

The schematic for this design is as follows:

O1 O2 O3 O4 X O5 O6 O7 O8

Multiple Time Series Design: This design combines the advantages of the nonequivalent control group design (page 61) with those of the time series design (page 62): the effects of history on program results are reduced by taking multiple baseline measurements, and the effects of maturation are reduced by the combined use of 1) a comparison group and 2) predicted trends in baseline measurements. As with the nonequivalent control-group design, a disadvantage of this design is that the groups are not strictly equivalent and may be exposed to different events that could affect results. The schematic for this design is as follows:

O1 O2 O3 O4 X O5 O6 O7 O8

O1 O2 O3 O4 O5 O6 O7 O8

CONVERTING DATA ON BEHAVIOR CHANGE INTO DATA ON MORBIDITY AND MORTALITY

You can convert data on changes in the behavior your program was designed to modify into estimates of changes in morbidity and mortality if you know the effectiveness of the behavior in reducing morbidity and mortality.

As an example, let us suppose your program was designed to increase seatbelt use. Let us also suppose that you counted the number of people wearing seatbelts at a random selection of locations around your city both before and after the program. You found that 20% more people in large cars and 30% more people in small cars are wearing seatbelts after the program than before.

To convert that 20% increase in seatbelt use (for people in large cars) to a decrease in deaths and injuries, you will need two sets of information:

The difference between a traveler ’s likelihood of (risk for) injury or death while wearing a seatbelt and while not wearing a seatbelt.

The number of deaths and number of injuries sustained by people involved in car crashes before the program began.

In our example, both sets of information are available.

Boehly and Lombardo (cited in Risk Factor Update Project: Final Report,10 p. IV-79) showed the relative risk for death or moderate-to-severe injury for people traveling under four conditions (see Table 4).

Statistics on deaths and injuries due to motor vehicle crashes are available, by state and by county, from the National Highway Traffic Safety Administration’s Fatality Analysis Reporting System.

Table 4. Relative Risk for Death or Moderate-to-Severe Injury in a Car Crash^¹⁰

Relative Risk

Car Size

Seatbelt Buckled

Seatbelt Unbuckled

Large >3,000 lbs

1.0

2.3

Small <3,000 lbs

2.1

5.0

Let’s say, for our example, that 125 people were severely injured or died in large cars and 500 in small cars during the year before the program began. Now the calculation:

1. Subtract the risk ratio for people wearing seatbelts in large cars (1.0) from the risk ratio for people not wearing seatbelts in large cars (2.3):

2.3 - 1.0 = 1.3

The result (1.3) is the amount of risk ratio that is attributable to not wearing seatbelts

2. Divide this difference (1.3) by the total risk ratio for people not wearing seatbelts (2.3):

1.3 ÷ 2.3 = 0.565

3. Express the result as a percentage:

0.565 x 100 = 56.5%

This calculation tells us that, when riding in a large car, people reduce their risk for injury or death by 56.5% if they buckle their seatbelts.

4. Multiply the percentage of decreased risk (56.5%) by the increase in the percentage of people wearing seatbelts in large cars (in our example, 20%):

56.6% x 20% = 0.566 x 0.20 = 0.1132 = 11.3%

This calculation shows that injuries and deaths are reduced by 11.3% among people in large cars when 20% more of them buckle their seatbelts.

5. Multiply the percentage of decreased risk in large cars (11.3%) by the number of injuries and deaths in large cars (in our example, 125):

11.3% x 125 = 0.113 x 125 = 14.125

This calculation shows that 14 fewer people will die or be seriously injured as a result of a 20% increase in seatbelt use by people traveling in large cars.

6. Repeat the same series of calculations for people traveling in small cars.

7. Add the numbers for large cars and for small cars to determine the total number of lives saved.

CONVERTING DATA ON BEHAVIOR CHANGE INTO DATA ON COST SAVINGS

To convert data on behavior change (e.g., increased seatbelt use) into estimates of financial savings per dollar spent on your program, you can do the same set of calculations as those used to convert data on behavior change into estimates of changes in morbidity and mortality (page 64). Then multiply the number of deaths and injuries prevented by the cost associated with deaths and injuries, and divide by the total cost of the program. For example, if your program to increase seatbelt use produces an estimate that it saved 14 lives during the previous year, multiply 14 by the average cost-per-person associated with a death due to injuries sustained in a car crash, then divide the result by the total cost of the program.

SUMMARY OF QUANTITATIVE METHODS

Quantitative methods of evaluation allow you to express the results of your activities or program in numbers. Such results can be used to draw conclusions about the effectiveness of the program’s materials, plans, activities, and target population. Table 5 lists the quantitative methods we have discussed in this chapter and the purpose of each one.

Table able 5. Quantitative Methods Used in Evaluation

Method

Purpose

Counting systems

To record the number of contacts with program participants and with people outside the program.
To record the number of items a program distributes or receives.

Surveys

To measure people’s knowledge, attitudes, beliefs, or behaviors.

Experimental studies

To minimize the effect of events outside the program on the assessment of a program’s effectiveness.

Quasi-experimental studies

To reduce the effect of events outside the program on the assessment of a program’s effectiveness when experimental studies are impractical.

Converting data on behavior change into data on morbidity and mortality

To estimate the number of deaths or injuries prevented as a result of program participants changing their behavior.

Converting data on behavior change into data on cost savings

To estimate the financial savings per dollar spent on your program.

This page last reviewed April 1, 2005.

Privacy Notice - Accessibility

Centers for Disease Control and Prevention
National Center for Injury Prevention and Control