|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Methodology (continued)
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Selecting Health Content for Review
Collecting and Validating Baseline Data
Each reviewer then independently reviewed the same websites from the first wave of five drawn from the stratified random sample. We assessed inter-rater reliability, identified and resolved discrepancies, and revised the protocols or clarified definitions, where indicated. We repeated this process of double-reviewing on two successive waves of five websites drawn from the full sample, achieving on both waves a raw Kappa score of 0.80 (a score generally accepted as demonstrating an acceptable degree of inter-rater reliability on survey protocols). Thereafter, each reviewer conducted separate reviews of the remaining sites, alternating between waves drawn from the stratum most frequently visited and the remainder, such that each reviewer reviewed an equal number of websites from both strata. The review supervisor was available throughout the data collection period to answer questions or establish and clarify decision rules. In addition, one website randomly selected from every two waves was reviewed independently by both reviewers to assure that inter-rater reliability remained high. Again, discrepancies were identified and resolved, in order to arrive at a single score for the doubly reviewed sites. In total, 24 websites from the final sample were doubly reviewed and 78 were singly reviewed. The raw Kappa score of inter-rater reliability for all doubly reviewed sites was 0.79 for all response items. The adjusted Kappa coefficient for disclosure elements that counted toward scoring was 0.81.11 Once reviewers had completed the initial baseline data collection, the MPR review supervisor cleaned and validated the data by (1) reviewing all responses that the reviewers had flagged with comments, (2) reviewing all "other" responses and reassigning them to specific response categories, (3) reviewing and validating all "not applicable" responses, (4) flagging missing responses and returning items to the reviewer for completion, (5) reviewing all items with a "no" response where a URL was indicated, and (6) reviewing and validating a subset of all items with a "yes" response where no URL was indicated. Questions of interpretation that arose during this review were discussed with reviewers and with the MPR project director, and adjustments were made to the data, as appropriate.
We then weighted the baseline data to account for the disproportionate sampling from the target stratum of most-frequently-visited websites and remainder websites, and to account for ineligible sites within each stratum that were eliminated from the final sample. We then analyzed the data using SUDAAN to produce weighted estimates of percentages of health websites in compliance with the criteria (and with disclosure elements associated with each criterion), as well as weighted estimates of compliance among the most frequently visited websites and the remainder websites. We also calculated weighted standard errors, relative standard errors, and 95 percent upper and lower confidence limits associated with all of the estimated percentages. Finally, we tested for statistically significant differences in compliance percentages across the two strata by criterion, at the 0.05 level of significance.
Table 4. Required Elements for Scoring and Optional Elements
11 The difference between raw and adjusted Kappa coefficients reflects the fact that multiple response options could count as compliance for particular items. Thus, in a given case, reviewers might disagree on which response option applied but still agree that the item was in compliance. Although the Kappa coefficient is commonly used to assess inter-rater reliability, some statisticians have identified problems with it, including a tendency to produce low scores even when agreement is high. We also calculated inter-rater reliability using the Lin's concordance correlation coefficient and found the sample concordance correlation coefficient (pc) = 0.8037, which similarly suggests moderate to substantial correlation. 12 The Technical Expert Workgroup initially recommended a disclosure criterion labeled Evaluation, which encompassed two elements: whether websites solicited user feedback and whether they disclosed how they used information from users to improve services or operations (i.e., evaluation of user feedback). Preliminary findings from the pretest suggested that although websites often solicited feedback, they rarely described how this feedback was used to improve services, making it difficult to determine compliance with this second element. Although we tracked both elements, the Project Officer and MPR team determined that only the first would be required for compliance with this criterion., and the criterion was subsequently renamed User Feedback. An Evaluation criterion may be added in the future, as websites' evaluation practices become more clearly defined. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||