SciPICH: Final Report - Evaluation of IHC Applications

Wired for Health and Well-Being: The Emergence of Interactive Health Communication

Editors: Thomas R. Eng, David H. Gustafson

Suggested Citation: Science Panel on Interactive Communication and Health. Wired for Health and Well-Being: the Emergence of Interactive Health Communication. Washington, DC: US Department of Health and Human Services, US Government Printing Office, April 1999.

Download in PDF format: [Entire Document] [References]

Chapter IV.

Evaluation of IHC Applications

The Panel considers widespread evaluation as the primary mechanism to improve quality of IHC. Evaluation is the examination of evidence in a way that provides a full perspective on the expected quality, nature, experience, and outcomes of a particular intervention. The purpose of evaluation is to systematically obtain information that can be used to improve the design, implementation, adoption, re-design, and overall quality of an intervention or program. This chapter provides some fundamental background information about evaluation for developers and purchasers and others who may have to conduct evaluations or interpret evaluation results.

Types of Evaluation

The design and implementation of an evaluation typically depends on its purpose, the status of the intervention, and the type of decision the evaluation is intended to address (Rossi and Freeman, 1993). The process of evaluation can be defined in the following stages.

Formative evaluation may be used in the early stages of development to assess the nature of the problem and the needs of the target audience(s), with a focus on informing and improving program design and ensuring accuracy of content.
Process evaluation may be used during application development and implementation to monitor the administrative, organizational, or other operational characteristics of the development and implementation processes.
Outcome evaluation may be used to examine an intervention’s ability to achieve its intended effect under ideal conditions (i.e., efficacy) or under real-world circumstances (i.e., effectiveness) and its ability to produce benefits in relation to costs (i.e., efficiency or cost-effectiveness). Traditional evaluation models may not adapt easily to rapid changes in IHC application design and implementation. For example, because the content of a health-related Web site can change rapidly, tests of effectiveness may quickly become outdated. With technology sectors, such as e-commerce, evolving dramatically, even process evaluations of implementation strategies may become outdated before they are finished. Because of the need for continuous quality improvement, more active and flexible models of evaluation may be more appropriate for IHC applications. There are many approaches to evaluation; selection of an appropriate method depends on the purpose of the evaluation and what is being evaluated. Discussion of all these methods is beyond the scope of this report and more details can be found elsewhere (Freidman and Wyatt, 1997).

The formative, process, and outcome evaluation model might be amplified by another perspective derived from training evaluation. Five levels or facets of evaluation for IHC can be conceptualized (Figure IV-1).

Level 1 would be "engagement and appeal" or reaction to the application with the intent of finding out whether audiences use and value an IHC application.
Level 2 would be "learning" with the intent of finding out whether audiences learned any knowledge, skills, and attitudes from the application.
Level 3 would be "behavioral change" with the intent of finding out whether the audiences have applied their new knowledge, skills, and attitudes in the real world.
Level 4 would be "impact" with the intent of revealing whether the changed behavior actually improved health status, reduced illness, reduced costs, or had other desirable effects.
Level 5 would be "return on investment (ROI)" with the intent of finding out whether the impacts that have been achieved have more value than the costs of developing and maintaining the IHC application.

Figure IV-1. Five Levels or Facets of IHC Evaluation
Level of Evaluation	Key Evaluation Questions
Level V: Return on Investment (ROI)	Did the benefits of the application exceed its costs?
Level IV: Impact	Did the application have any benefits?
Level III: Behavior Change	Did users change their behaviors?
Level II: Learning	Did users’ knowledge, skills, and attributes improve?
Level I: Engagement and Appeal	Did users like the application?

Adapted from: Kirkpatrick DL. Evaluating Training Programs: The Four Levels. San Francisco, CA: Berrett-Koehler Publishers, 1994; and Phillips JJ (Editor). Measuring Return on Investment: Eighteen Case Studies From the Real World of Training. Alexandria, VA: American Society for Training and Development, 1994.

From the perspective of many stakeholders, particularly purchasers and users, evaluation of a proposed health intervention may focus on the central question, "Does this intervention provide enough measurable positive outcomes to justify the cost?" There are no widely accepted standards for measuring outcomes and costs associated with IHC applications. The Panel on Cost-Effectiveness in Health and Medicine, however, recently developed a framework for cost-effectiveness analyses that is applicable for the assessment of any health intervention (PCEHM, 1996; Russell et al., 1996). Their technical guidance may be helpful for developers and evaluators of IHC applications. Outcomes to be measured for any intervention should include both benefits and harms associated with the intervention. In assessing the total costs of an application, it would be appropriate to include costs associated with any change in both health- and nonhealth-related resource uses. For IHC evaluations, both actual costs of pilot projects and projected costs of large-scale implementation of the application should be considered.

Distinction Between Evaluation and Research

Research and evaluation are components of a continuum of disciplined inquiry that are driven by different goals. Research generally has two types of goals: theoretical and empirical. Research with theoretical goals is intended to explain phenomena through the logical analysis of the results of scientific investigations and the synthesis of these results, along with theories and principles from other fields and original insights, to develop new or refine existing theory. Research with empirical goals is intended to determine how and why phenomena occurs by testing hypotheses related to theories, eventually leading to increased capacities to describe, predict, and control phenomena. Evaluation, however, generally has two different types of goals: formative and summative. Evaluation with formative goals is intended to support the development and improvement of innovative solutions to problems. Evaluation with summative goals is focused on estimating the effectiveness and worth of a particular program, product, or method for the purpose of making a decision about it in an applied setting. Typical decisions might be selection, purchase, certification, extension, or elimination.

With a continuum of such goals in mind, the need to make sharp distinctions between research and evaluation is reduced. One issue that must be clarified is that rigor and discipline are not necessarily distinguishing features between research and evaluation. The research to evaluation continuum represents a shift from theoretical goals to goals that are more action-oriented. Research is generally focused on adding to the body of knowledge about phenomena, whereas evaluation is usually focused on solving particular problems; rigor and discipline are important aspects of both.

Qualitative Methods and Statistical Process Control

Evaluation methods often focus on the need to prove rather than explain an effect. Hence, resources are allocated toward large sample sizes and with a focus toward one- or two-time assessments of effect. Such strategies are appropriate for addressing stable applications whose effects need to be demonstrated beyond doubt. However, the field of IHC is evolving and the content and even structure of applications will change to keep up with new knowledge. Moreover, because IHC is in its infancy, the goals of evaluation should be to not only determine effectiveness but also to guide improvements. This implies the need for evaluative efforts that explain effect, offer guidance for improvement, and monitor the changing nature of IHC over time. Toward that end, it may be more valuable to monitor the impact over an extended period of time on a smaller sample of users and to invest resources in understanding why things happened as they did.

Qualitative research methods and statistical process control may be important resources for such evaluation strategies. Qualitative research relies on observation and interviews with stakeholders to better understand the underlying causes of success or failure. This understanding could be very important as ongoing improvements to the application are made. Statistical process control provides a strategy to monitor application performance over time and to identify when the application is moving out of control. Such techniques could help monitor the dynamic nature of electronic support groups and identify whether discussions are having detrimental effects. It could also be useful to detect when there are significant changes in use patterns that may warrant further examination or even intervention. However, these strategies also could be used to assess the effectiveness of IHC applications. Because the applications are of such a dynamic nature and their impact may be a cumulative one, the goal may not be to conclude beyond a doubt at one point in time but beyond a reasonable doubt across the life span of the application.

Potential Benefits of Evaluation

From the perspective of potential stakeholders of IHC applications, the potential benefits of widespread evaluation include the following (Eng et al., 1999):

Improved quality, utility, and effectiveness. Evaluation allows for the identification of potential problems and provides important feedback for application development and quality improvement. This leads to more effective and useful applications for users. Both positive and negative evaluation results are valuable in advancing the field of IHC. For example, negative results promote development of effective products by reducing resources spent on ineffective approaches.
Reduction of likelihood of harm. Evaluation of health impact may identify and reduce the use of IHC applications likely to have unexpected harmful effects.
Better use of resources for effective applications. By informing purchasing and implementation decisions, evaluation can help target resources on effective applications and avert the investment of resources on ineffective ones.
Greater participation of stakeholders in the development and implementation process. Appropriate evaluation necessitates engaging users and other stakeholders early in the development process. This can increase the probability of a favorable impact on health and quality outcomes.
Improved decisions about applications. The results of outcome evaluations can help users, purchasers, and others make informed choices about the selection of appropriate and effective applications.

Developers of IHC applications also may benefit substantially from adopting a norm of evaluation (Henderson et al., 1999). From their perspective, evaluation may improve their chances of success in the following ways:

Increased sales and market share. Many consumers and purchasers tend to perceive evaluated products to be of better quality than those that have not been examined. For example, products that receive high ratings from independent consumer organizations may sell much better than those that are either not rated highly or not evaluated at all.⁵ Decisions about large-scale implementation of an IHC application will likely depend largely on assessments of outcome evaluations. That is, health plans and other large purchasers of IHC applications tend to be interested in products that have been evaluated and shown to effective or cost-effective for their organizations. Evaluated products also are likely to be perceived as more trustworthy.
Higher profit margins and return on capital. Consumers and other purchasers often are willing to pay more for an evaluated product that has been reviewed favorably. This may lead to greater investment value, such as market capitalization, for investors.
Improved effectiveness, utility, reliability, and innovation. By incorporating evaluation methods throughout the development and implementation process, developers can gain valuable feedback from users to inform product design and ensure a more attractive, effective, and user-friendly application. In addition, evaluation can encourage innovative application design by identifying promising approaches for further development.
Greater acceptance by health professionals. Many IHC applications involve the participation of health professionals. They generally respond well to, and are more accepting of, products that have undergone an evaluation in a peer-reviewed manner.
Decreased potential liability for harm. Developers who have evaluated their product thoroughly to minimize any associated health risks may be less likely to be found negligent if an individual claims the product resulted in some harm.
Reduced likelihood of government regulation. Unless products are routinely evaluated, it is likely that some potentially harmful programs will result in severe health consequences, and public calls for government regulation of the industry may result. Adoption of a voluntary standard for routine evaluation of applications by developers may avert harm and forestall resulting government intervention.
Promotion of positive public image. If harmful products are released, the public image of IHC developers and the industry could be tarnished. This may lead to substantial reductions in use of IHC.

Psychosocial Theories and Models and Evaluation of IHC

The psychosocial theories and models summarized in Chapter III can be utilized in evaluations of IHC applications. For example, researchers have examined whether appropriate matching (tailoring) of psychosocial concepts to individuals influences behavior change and informed decisionmaking more than providing unmatched concepts (Curry et al., 1991; Velicer et al., 1993; Campbell et al., 1994; Skinner et al., 1994; Strecher et al., 1994; Brug et al., 1996, in press; Shiffman et al., 1997) or deliberately mismatched communications (Dijkstra et al., in press). These outcome evaluations provide information related to whether the overall approach was successful.

It should be possible to determine whether IHC applications are influencing targeted psychosocial concepts and whether these applications are moving individuals through the maps laid out by theory-builders. Assessing the concepts either before or as part of the IHC application, followed by post-treatment assessment of the same concepts, allows evaluators to examine changes in the concepts targeted by the application. For example, if perception of one’s risk is viewed as an important factor in health-related behavior change, then it should be possible to determine whether the application is influencing this concept. In turn, it should be possible to determine whether changes in risk perception influence changes in the targeted behavior.

Many current psychosocial theories are sufficiently organized to hypothesize the relevance of a construct based on the specific state of the individual. For example, risk perception, in a number of models, would be more relevant to an individual not interested in changing a health-related behavior than to an individual ready to make the change. Other concepts, such as self-efficacy, become more relevant once the person is interested in making a change (Velicer et al., 1985; Bandura, 1986; Weinstein, 1988; Prochaska et al., 1992; Strecher and Rosenstock, 1998; Dijkstra et al., in press). Assessment of motivational stage has been an important method of framing a broad spectrum of behavior change interventions (Velicer et al., 1985; Prochaska et al., 1992). Evaluative efforts could, in turn, determine whether the individual moves through these stages of change as a result of the IHC application.

Standard evaluations of outcomes determine whether an application works. Evaluations that examine intermediate, psychosocial concepts linked with a conceptual framework of the IHC application determine why an intervention did or did not work. Both are important as more powerful, relevant applications are developed. Understanding how and when to measure intermediate psychosocial processes requires an understanding of the relevant theories and the psychometric properties of the concepts within these theories. For this reason, it is important that individuals with expertise in behavior change and decisionmaking theories become more involved at the earliest stages of IHC application development. Explicit development of conceptual frameworks guiding the content of the program may lead to stronger applications and improve the quality of evaluations for IHC applications.

Link Between Application Development and Evaluation

Evaluation of IHC is an ongoing process that begins during the product development cycle and continues for the life of the product. Given the highly dynamic state of IHC, development and accompanying evaluations would never really end because content will become outdated and new technology-based approaches and delivery methods will emerge. In addition, there is a role for evaluation even after an evaluated product has been in the marketplace for a period of time. As with drugs and medical devices, post-marketing surveillance data can alert developers and policymakers to potential harm associated with product use that may not have been detected in initial evaluations among limited study populations.

It is helpful for developers to understand the relationship between development and evaluation activities during the product development cycle. An inventory of potential application development and evaluation activities is presented in Table IV-1. At each stage of application development, from conceptualization and design to assessment and refinement, there is a series of evaluation activities that are relevant and should be considered. An array of evaluation methods and tools can be used to implement these evaluation activities. As illustrated by Table IV-1, there may be some overlap between development and evaluation activities.

Ideally, an evaluation plan should be formulated at the conception of an application. User needs and the objectives of the application should be clearly specified prior to implementation. Identifying intended effects helps define the outcomes of interest and the appropriate evaluation design to measure outcomes. Needs assessment is one of the initial stages of evaluation and the results of this analysis help determine product specifications. Evaluations during product development include component testing to ensure that all aspects of the system perform accurately and meet design specifications. Iterative usability testing to ensure that the product meets the needs of potential users with regard to usability and the facilitation of workflow or tasks is critical.

Experience has shown that several 1- to 2-hour sessions where individual learners are observed as they use an IHC application, and then are personally interviewed, can provide accurate usability feedback. Just four or five participants can provide sufficient information to complete a study of an application. Because of the small number of participants, this approach is more easily arranged than those with larger groups, and can be completed in one to three days depending on the facility and personnel available. If there is sufficient funding, IHC designers should utilize the services of professional usability testers. If funding is modest, designers may choose to conduct their own usability testing using portable usability lab equipment. When conducting one-on-one usability studies, it is very helpful to maintain a relaxed and informal atmosphere that encourages both negative and positive participant feedback. Without proper rapport, participants will likely be less open and may unintentionally invalidate the study. Developers should realize, however, that a usability lab may be much more of a controlled environment than the home. With experience, any developer can learn the skills necessary to conduct usability testing at the minimum level of formality required to obtain strong evidence that can be used to improve an application.

The next stage of evaluation is to measure outcomes during system use. At this stage, conducting a pilot evaluation to work out the implementation details of the evaluation and assessment tools is often helpful. Quite often, there are obvious misunderstandings of terms or unanticipated barriers that can be corrected before beginning the larger, more complete study. Because evaluation of IHC applications should be a continuous process, there is no "final" stage of evaluation. For many IHC applications, a long-term commitment to a process of updating and revision with ongoing quality-assurance evaluations is required.

Table IV-1. Key Development and Evaluation Activities in IHC Application Development
Stage Methods	Key Development Activities	Key Evaluation Activities	Potential Evaluation Methods
Conceptualization and Design	Describe the health issue/problem Identify existing programs and gaps Identify target audience(s) and needs Identify program goals and objectives Identify messages and content Identify and collect relevant raw information and data Tailor and develop content and data to fit needs Identify resources Develop business plan and marketing/ dissemination/ communication strategy Draft product timetable Identify media access among target audience(s) Select specific media to utilize	Formative evaluation Assess needs of audience(s) and whether needs are adequately addressed in design Assess scientific literature Assess relevance of completed evaluations of similar products Develop evaluation plan Develop and pretest communication strategies Pretest content (messages and information) on target audience Pretest prototypes on target audience(s) and revise design as needed Assess and specify system requirements, features, and user interface specifications	Case studies Focus groups Task analysis Surveys Interviews Literature reviews Simulations
Implementation	Establish process measures	Process evaluation Monitor the operational characteristics of the intervention Assess security, accuracy, reliability, usability, response time Assess user satisfaction and utilization patterns	Simulations Pilot tests Focus groups Protocol analysis Interviews Statistical process controls Total quality management/ continuous quality improvement Usability testing
Assessment and Refinement	Implement evaluation of short- and long-term impact Revise program based on evaluation and feedback	Outcome evaluation Examine intervention’s ability to achieve its intended effect and/or its cost-effectiveness Analyze feedback and evaluation results Share evaluation results and lessons learned with others	Randomized controlled trials Quasi- experimental trials Surveys

Partially adapted from: National Cancer Institute. Making Health Communication Programs Work. Bethesda, MD: National Institutes of Health, US Department of Health and Human Services. NIH Publication no. 89-1493, April 1989.

Original version published in: Eng TR, Gustafson D, Henderson J, Jimison H, Patrick K, for the Science Panel on Interactive Communication and Health. Introduction to evaluation of interactive health communication applications. Am J Prev Med 1999;16:10-15.

Challenges of Evaluating IHC Applications

It would be misleading to suggest that high-quality evaluations of applications will be conducted if only developers would simply decide to do so. Indeed, there are several challenges to evaluation of IHC applications—some technical and some related to external forces—that will need to be addressed. High-quality evaluations will require careful planning and implementation, along with consideration of the following factors:

The dynamic nature of IHC technologies and application content. Regular updates to IHC applications are common because of the rapid changes in information and communication technologies and constant advances in biomedical and public health research that lead to new health information. The advent of software agents that automatically update content of Internet-based IHC applications is another challenge to evaluation. Therefore, evaluation methods must be able to monitor changing applications over time. This is in contrast to evaluation of static communication media such as books and journal articles. In addition, there is often a trade-off between accuracy and currency of information used in IHC applications because, over time, new health information typically becomes more refined and its relevancy better understood. For example, information appearing in IHC applications may not have been "vetted" to the extent that it is in textbooks and other more static media.
The wide spectrum of applications and vehicles for dissemination. The variety of methods for dissemination of IHC applications may influence program effectiveness and complicate assessments of utility. For example, because of privacy and confidentiality concerns, an application dealing with a sensitive health issue may be more widely and appropriately used if it is available anonymously through a public network rather than through a health plan’s private network.
The complex nature of IHC technologies. It may be very difficult to accurately assess the relative effects of IHC program content, design, user interface, media selection, method of dissemination, and user-specific characteristics. The need to account for nonintervention-related environmental factors, including the myriad of other media influences that may influence health outcomes, further complicates evaluation design.
Lack of practical approaches and tools. Practical, evidence-based evaluation approaches and tools that are sufficiently flexible to evaluate heterogeneous applications over time, and can be used by stakeholders with varying evaluation skills and needs, are limited.
Perceptions about evaluation. Some developers believe that evaluation may delay product release and increase development costs, and that product marketing, rather than product evaluation results, is the key determinant of sales. Some investors may have a short time-horizon and discourage any potential delays in bringing a product to market. In addition, some developers perceive that purchasers are unwilling or hesitant to pay for product evaluations. These concerns are common among developers of technology-related products because the competitiveness of the field mandates special attention to time-to-market and development costs.
Need to evaluate implementation strategy. Because the implementation strategy of an application, in addition to the characteristics of the application itself, is likely to impact on its use and effectiveness, evaluators also need to consider implementation issues.

Evaluation Criteria

A number of organizations and individuals have published, and, in some cases, implemented criteria for evaluating the appropriateness or quality of health-related and other Web sites (Jadad and Gagliardi, 1998; Pealer and Dorman, 1997). Some of these criteria are the basis for tools used to produce a summary rating or grade to help potential users assess the site. There are literally dozens of criteria proposed in the literature (Kim et al., 1999), many of which are closely related.

In selecting and prioritizing criteria to use in evaluating IHC applications, developers and other evaluators often will consider many factors, including the objectives of application and the preferences and values of the evaluator and potential users.⁶ After identifying relevant criteria, the relative weights assigned to each criterion may vary depending on the application. For example, for an application that provides information about clinical trials to the general public, accuracy and appropriateness of content may receive relatively heavy weighting. In contrast, evaluators of an application that focuses on enhancing peer support for a chronic health condition among a disabled population may choose to emphasize the usability of the program.

For general purposes, key criteria for evaluation that can be applied to most programs include (Henderson et al., 1999):

Accuracy of content. This includes a number of components, including currency and validity. Sometimes new and seemingly accurate information may not be validated under the scrutiny of time and broader experience. Ensuring the accuracy of the content is not always clear-cut because there is a close relationship between accuracy and other attributes of information. For example, it is possible that information is accurate but still misleading in certain contexts. Wide variations in use of medical interventions have been linked to varying interpretations of the same evidence. Accuracy can be in the eye of the beholder. In addition, in some applications, the boundary between actual content and advertising may be blurred and identification of the source of the content may be difficult.
Appropriateness of content. This includes applicability and intelligibility to the user. Many applications are intended for use by only certain groups of people or only in specific situations. Developers need to be explicit in identifying appropriate audiences to ensure that the content is both applicable to such users and that the likely users can understand it.
Usability. This measures how easily a user can get the program to perform as intended. This is where quality interface design is critical. A flashy interface may be appealing to the senses, but actually make an application harder or more intimidating to use. Usability of any computer program, including IHC applications, is a combination of the following user-oriented characteristics (Schneiderman, 1997): 1) ease of learning, 2) high speed of user task performance, 3) low user error rate, 4) subjective user satisfaction, and 5) user retention over time. Hix and Hartson (1993) and Nielsen (1993) provide expert guidance to evaluating user interface issues, a process known as usability testing. The three major usability classifications are efficiency, user satisfaction, and effectiveness. Characteristics such as cost savings or minimizing training time fall under the classification of efficiency and are strong concerns for any organization making the investment in interactive learning. Ease-of-use, perceived benefit versus time invested, intuitiveness, and visual appeal are generally classified as user satisfaction. Immediate retention, retention over time, and transfer to actual job performance are categorized as effectiveness. Unfortunately, effectiveness is the least likely classification to be measured even though it is the primary intent of education and training in most contexts. Another component of usability is acceptability. Developers must be careful that the interface, or its elements, do not intimidate or antagonize users.
Maintainability. This is important because application content and design and likely users may shift over time, thus, requiring modifications to the program. A plan for who will implement changes, how the changes will be accomplished, and the resources required is necessary before implementation of the application.
Bias. There are many sources of bias, including origin of funding and personal biases due to background and training of developers and evaluators. It is imperative that developers and evaluators incorporate strategies to prevent or minimize bias. Sources of bias should be disclosed explicitly but they cannot be eliminated because the perception thereof is dependent on the individual user. Nevertheless, it is important to be sensitive to, and aware of, both potential and actual biases. For example, if a program incorporates an assumption that alternative medicine is good (or is bad), it can be both limiting (e.g., to whom it has sales appeal) and dangerous (e.g., in terms of liability to both developer and provider). Although conflicts of interest do not necessarily lead to bias, it is often nearly as important to avoid the appearance of bias as it is to avoid it in reality. Thus, it is incumbent upon developers and evaluators to avoid any potential conflicts of interest. When this is not possible, it is essential to use the most objective and bias-resistant criteria.
Efficacy and effectiveness. These are measures of the extent to which a program actually has its intended impact. For example, for applications that promote behavior change, does the program actually help people adopt the new behavior? For decision support applications, does the program provide adequate, reliable information that enables the user to make an informed decision? Does it result in decisions demonstrably more consistent with the patients’ stated preferences? Technically, efficacy refers to a program’s impact under controlled (experimental) environments, and effectiveness is the program’s impact under real-world conditions. It is possible for a program to be efficacious in controlled trials but not be very effective when implemented under field conditions.

Standards of Evidence

Much of the controversy in the field of evaluation has to do with standards of evidence. An understanding of this concept is helpful in interpreting evaluation results. Two central concepts are the reliability and validity of the evaluation.

Reliability and Validity

Reliability can be seen as repeatability: If one asks the same question of the same people repeatedly, would he or she get the same answer? Poor reliability makes it much more difficult to measure the effect of an intervention. Thus, it is very important in evaluations to be certain that what one is asking is understood fully by those who are being asked, and that they can provide dependable or reliable answers.

The validity of evaluation findings can be viewed as the truthfulness of the findings. Do the measures really reflect what is intended to be measured? Are the findings correct, or are they an aberration? Are they meaningful in this context? There are two types of validity: internal and external. Internal validity is the validity of the findings within the study itself. External validity is the validity of applying the findings to other situations. External validity often is referred to as "generalizability." If the people who tested a program liked it, will everyone else who uses it have the same overall reaction? Can the results obtained with the study sample be generalized to other groups? Generalizability can be critically important because, in some situations, developers rely on the findings or results obtained by others. For example, if tailoring improves message impact in similar settings, it may be more appropriate for a developer simply to adopt a proven approach rather than to conduct additional evaluations.

Judging Effect: Statistical Significance and Effect Size

Many evaluators emphasize the statistical significance of outcome findings, and some may conduct statistical tests on a variety of outcomes hoping to find a statistically significant result. Although statistical significance is an important measure of intended effect, it can be over-emphasized. The key concepts underlying statistical significance are as follows: To what degree are we confident that the results did not occur by chance? Is there really a connection between use of the program and the outcomes? What are the chances that the outcomes really are due to the intervention, rather than due to chance and chance alone? The traditional metric of scientific studies is a p-value less than 0.05, which simply means that no more than 5 percent of the time, or 1 in 20 times, would one expect a given result to occur by chance. In other words, there is at least a 95 percent probability that the outcomes occurred because of the program rather than by chance. Reporting absolute probabilities often may be helpful. Statistical significance depends greatly on the size of the study sample (i.e., the number of participants in the evaluation). A larger sample size and/or a larger effect size both contribute to greater statistical significance.

When judging the usefulness of an IHC application, effect size often is a more important concern. Effect size is used to describe the magnitude of impact the intervention has on its users. For example, for a program that encourages diabetics to monitor their blood sugar more carefully, just how much more (or less) carefully do they do it after using the program? If an application is designed to decrease utilization of a service, to what extent do users of the program utilize that service less (or more) than people who did not use the program? While the statistical significance of results is important, it may be more meaningful to know how strongly it affected the users. Therefore, effect size should be considered along with statistical significance in evaluating outcomes.

What is a reasonable standard of evidence for IHC applications? Subjecting all IHC applications to randomized-controlled trials is neither practical nor appropriate. Although such trials produce the strongest evidence, they are not suitable for all interventions or for all stages of product development and dissemination. Developers face the challenge of balancing the need to conserve limited resources with protecting the safety of users and ensuring that the program is effective. One reasonable approach is to match the level of evaluation to the intended purposes of the application and the resources it consumes. That is, in the case of applications that have substantial potential risk or require a large investment, it seems appropriate to demand a higher level of evidence, such as an appropriately designed and implemented randomized-controlled trial. The level of confidence in the evidence of safety and efficacy for such interventions (e.g., shared decision support applications for serious illnesses) should be "beyond a reasonable doubt." However, for interventions that have minimal potential risk and require few resources (e.g., Web sites that provide general information from trusted and reliable sources), formative and process evaluations may be sufficient to provide a "preponderance of evidence" indicating that the application will be beneficial to users. In addition, evaluation methods, such as interviews and focus groups, often may provide important insights and understanding of how an application may benefit users as randomized-control trials.

Standardized Reporting of Evaluation Results

Prior to the Panel’s work, there were no models for standardized reporting of evaluation results for IHC applications. As a first step toward promoting appropriate evaluation and disclosure about IHC applications, the Panel developed an "evaluation reporting template" (Appendix A) and a "disclosure statement" (Appendix B) to serve as a guide for reporting essential information and the results of any evaluations about a specific IHC application (Robinson et al., 1998).

The template is based on the rationale that all applications should undergo some level of evaluation, and that the results of such evaluations should be available to potential users and purchasers of the application. Disclosure of such information may enable potential users and purchasers and others to judge the appropriateness of a given IHC application for their needs and compare one application with another. The notion of disclosure of information about IHC applications is similar to the common practice of disclosing information about the use of a potential intervention or consumer product. Examples of this practice include health professionals informing patients about the risk and benefits of potential treatment options or experimental trials (Rodwin, 1989), and manufacturers disclosing product information (e.g., automobile specifications, nutritional content analyses) that may be critical to a potential buyer’s decision.

In developing the template, the Panel identified a critical set of information that would help inform decisions about use and purchase and also would apply to essentially all IHC applications, regardless of the specific technologies or communication strategies employed or the goals of the program. Some developers may find addressing all the elements of the template to be somewhat overwhelming but not all IHC applications need to be evaluated in all of the categories specified in the template. To the contrary, evaluation targets should reflect the specific needs of the target audience and the objectives of the developer.

The Panel believes that all IHC stakeholders can benefit from a voluntary standard of reporting evaluation results. This template and its future versions can: 1) assist developers plan, conduct, and report the results of their evaluations; 2) help users determine which applications are most likely to benefit them given their particular needs; 3) assist clinicians in selecting relevant applications for their patients; and 4) help purchasers, investors, and policymakers focus on the most promising applications and strategies for investment and dissemination.

Will developers of IHC applications voluntarily disclose information about their products? As mentioned previously, there are several benefits to developers who conduct evaluations. With increased awareness among users and purchasers about the possibility of harmful effects or no effect from IHC applications, these groups will increasingly seek information about an application before using or purchasing it. If the current leaders in IHC development begin the process of public disclosure of information about their products, market forces may pressure other developers to follow.

Although version 1.0 of the template arose from an extensive multiyear development effort, additional refinement is necessary, and the template will need to be updated as it is used and the field evolves. As with all instruments of this type, deficiencies will be identified and improvements can be made as the template and disclosure statement are circulated to, and used by, wider audiences.

⁵ There is limited scientific research on the impact that public release of evaluation results of goods and services has on subsequent sales, but anecdotal reports suggest that products rated highly by Consumer Reports tend to sell better and low-rated products decrease in sales (Shapiro, 1992; Kelly, 1994; Eldridge, 1997).
⁶ See Appendix E for a discussion of evaluation criteria from the user’s perspective.

Return to Table of Contents

Comments: SciPICH@nhic.org Updated: 05/01/08