Estimating the Proportion of Health-Related Websites Disclosing Information That Can Be Used to Assess Their Quality

Final Report - May 30, 2006


Methodology (continued)


Collecting and Analyzing Data for the Baseline Analysis

Selecting Health Content for Review

Because our protocols call for a review of three separate items of health-related content to answer specific questions, the review supervisor randomly selected three items of health content for review within each website deemed eligible. Our aim in selecting the items for review ahead of time was to minimize selection bias that might result from a given reviewer's particular interests or from website sponsors' efforts to direct users to featured content. For each site, we used random numbers to select three items of health content from available options, starting with hyperlinks on the home page. Any content thus reached that was consistent with the eHealth Code of Ethics definition of health information was sampled, including content reached through hyperlinks to other websites and stand-alone documents in .pdf format. However, we did not include health content that was in audio or video format.

Collecting and Validating Baseline Data

We transferred the revised protocols to an Access database to facilitate data input, scoring, and analysis. We trained two MPR reviewers on the use of the protocols and briefed them on the nuances of interpretation that arose during the pretest. We then set up two computer screens for each reviewer to allow them to view and navigate both the website under review and the protocol at the same time.

Each reviewer then independently reviewed the same websites from the first wave of five drawn from the stratified random sample. We assessed inter-rater reliability, identified and resolved discrepancies, and revised the protocols or clarified definitions, where indicated. We repeated this process of double-reviewing on two successive waves of five websites drawn from the full sample, achieving on both waves a raw Kappa score of 0.80 (a score generally accepted as demonstrating an acceptable degree of inter-rater reliability on survey protocols). Thereafter, each reviewer conducted separate reviews of the remaining sites, alternating between waves drawn from the stratum most frequently visited and the remainder, such that each reviewer reviewed an equal number of websites from both strata. The review supervisor was available throughout the data collection period to answer questions or establish and clarify decision rules. In addition, one website randomly selected from every two waves was reviewed independently by both reviewers to assure that inter-rater reliability remained high. Again, discrepancies were identified and resolved, in order to arrive at a single score for the doubly reviewed sites. In total, 24 websites from the final sample were doubly reviewed and 78 were singly reviewed. The raw Kappa score of inter-rater reliability for all doubly reviewed sites was 0.79 for all response items. The adjusted Kappa coefficient for disclosure elements that counted toward scoring was 0.81.11

Once reviewers had completed the initial baseline data collection, the MPR review supervisor cleaned and validated the data by (1) reviewing all responses that the reviewers had flagged with comments, (2) reviewing all "other" responses and reassigning them to specific response categories, (3) reviewing and validating all "not applicable" responses, (4) flagging missing responses and returning items to the reviewer for completion, (5) reviewing all items with a "no" response where a URL was indicated, and (6) reviewing and validating a subset of all items with a "yes" response where no URL was indicated. Questions of interpretation that arose during this review were discussed with reviewers and with the MPR project director, and adjustments were made to the data, as appropriate.


Scoring and Analyzing Data

Once the baseline data were cleaned and validated, we coded all responses for scoring and analysis. Because the pretest revealed a lack of consistency in the way websites describe some elements, the protocol includes multiple response options for some items, any one or combination of which may count as disclosure. We assigned one point for any response option that would count as disclosure of a required element. For disclosure elements that were associated with selected items of health information content, we assigned one point per item of health content. We then determined compliance at the criterion level: if the total score for that criterion equaled the number of required disclosure elements subsumed under that criterion, then the site was determined to be in compliance on the given criterion. If the total score for the criterion was less than the number of required disclosure elements for the criterion, it was designated as noncompliant even if some of the elements were present. The number of points needed for compliance varied by criterion, from one point for the User Feedback criterion to six points for the Content Updating criterion (which required two disclosure elements on each of the three selected items of health content).12 To be fully compliant with all six criteria, a website needed to disclose 20 separate elements. Some optional elements of interest to ODPHP that would not count toward disclosure were also tracked but were not included in scoring. Table 4 shows the criteria, required (and optional) disclosure elements, and the points assigned to each.

We then weighted the baseline data to account for the disproportionate sampling from the target stratum of most-frequently-visited websites and remainder websites, and to account for ineligible sites within each stratum that were eliminated from the final sample. We then analyzed the data using SUDAAN to produce weighted estimates of percentages of health websites in compliance with the criteria (and with disclosure elements associated with each criterion), as well as weighted estimates of compliance among the most frequently visited websites and the remainder websites. We also calculated weighted standard errors, relative standard errors, and 95 percent upper and lower confidence limits associated with all of the estimated percentages. Finally, we tested for statistically significant differences in compliance percentages across the two strata by criterion, at the 0.05 level of significance.

 Table 4. Required Elements for Scoring and Optional Elements
Required Disclosure Elements

Criterion Description Number of Points Optional Disclosure Elements

Identity Name of person or organization responsible for website 1
Street address for person or organization responsible for website 1 Other contact information for person or organization responsible for website
Identified source of funding for website 1
Subtotal 3

Purpose Statement of purpose or mission for website 1
Uses and limitations of services provided 1
Association with commercial products or services 1
Subtotal 3

Content Differentiating advertising from non-advertising content 1
Medical, editorial, or quality review practices or policies 1 Names/credentials of reviewers
Authorship of health content (per page of health content) 3
Subtotal 5

Privacy Privacy policy 1
How personal information is protected 1
Subtotal 2

User Feedback Feedback form or mechanism 1 How information from users is used
Subtotal 1

Content Updating Date content created (per page of health content) 3 Copyright date
Date content reviewed, updated, modified, or revised (per page of health content) 3
Subtotal 6

Total 20




11 The difference between raw and adjusted Kappa coefficients reflects the fact that multiple response options could count as compliance for particular items. Thus, in a given case, reviewers might disagree on which response option applied but still agree that the item was in compliance. Although the Kappa coefficient is commonly used to assess inter-rater reliability, some statisticians have identified problems with it, including a tendency to produce low scores even when agreement is high. We also calculated inter-rater reliability using the Lin's concordance correlation coefficient and found the sample concordance correlation coefficient (pc) = 0.8037, which similarly suggests moderate to substantial correlation.

12 The Technical Expert Workgroup initially recommended a disclosure criterion labeled Evaluation, which encompassed two elements: whether websites solicited user feedback and whether they disclosed how they used information from users to improve services or operations (i.e., evaluation of user feedback). Preliminary findings from the pretest suggested that although websites often solicited feedback, they rarely described how this feedback was used to improve services, making it difficult to determine compliance with this second element. Although we tracked both elements, the Project Officer and MPR team determined that only the first would be required for compliance with this criterion., and the criterion was subsequently renamed User Feedback. An Evaluation criterion may be added in the future, as websites' evaluation practices become more clearly defined.



< PREVIOUS     |      TABLE OF CONTENTS     |      NEXT >