HINTS  Health Information National Trends SurveyANALYTIC METHODS TO EXAMINE CHANGES ACROSS YEARS USING HINTS 2003 & 2005 DATADivision of Cancer Control and Population Sciences Lou Rizzo, Ph.D.^{1} ^{1}Westat Inc.; 1. Introduction 1. INTRODUCTIONThe Health Information National Trends Survey (HINTS) is a national, biennial survey designed to collect nationally representative data on the American public’s need for, access to, and use of cancerrelated information. The primary task of HINTS is to monitor changes in the rapidly evolving field of health communication. This survey is sponsored and directed by the National Cancer Institute’s Division of Cancer Control and Population Sciences. The baseline year is 2003, and data from the first followup sample in 2005 are also available (see http://hints.cancer.gov). A second followup sample (for 2007) is currently being implemented. Each biennial sample is drawn using a randomdigitdial (RDD) sample design to produce a representative sample of telephone households in the country. Exchanges with high percentages of Blacks and Hispanics were oversampled in 2003, in order to provide a larger yield of these important subgroups. In a second stage of selection, one adult was randomly selected among all adults living in the sampled household. This adult was recruited to complete the main survey instrument by telephone interview^{4}. Weights are assigned to account for all of the stages of selection (from the RDD sampling frame and within the household), and for attrition from noncontacts, screener nonresponse, and interview nonresponse. These weights are designed to provide approximately unbiased estimators of population totals using a modified HorvitzThompson estimator (see for example Cochran 1977, Section 9A.7)5. Replicate weights are also provided to allow for consistent variance estimation. The replicate weights for all of the biennial HINTS surveys are based on the jackknife replication method, with R = 50 replicate weights for each survey year. The replicate weights are formed by deleting a carefully selected portion of the original sample (roughly 1/50 of the original sample), and reweighting the remaining sample as if the complement set was the full sample. Estimates are computed using each set of replicate weights, generating a set of parallel replicate estimates to the estimate of interest. The sum of squares of the deviations between the replicate estimates and the ‘fullsample’ estimate, with appropriate adjustment, provides consistent estimators of the variance. For example, suppose is an estimator (a percentage within a subgroup, for example) using the ‘fullsample’ weights. We generate replicate estimators _{r} in parallel, doing the calculation in the same way, but using each set of replicate weights in place of the original fullsample weights. The jackknife variance estimator of is Final methodology reports are available for both HINTS 2003 and HINTS 2005 and are accessible online at no cost on http://hints.cancer.gov. These reports provide details of the sampling and weighting for the respective surveys. This methodology paper is closely based on a similar methodology paper (Lee, et al. 2007) for the California Health Information Survey (CHIS). ^{4}In HINTS 2005, a small number of persons completed interviews via the Internet, as part of an experimental study nested within the main HINTS survey. 2. THREE TYPES OF ANALYSES USING MULTIPLE BIENNIAL HINTS SURVEYSThroughout this document, we will provide examples of HINTS analyses, using as our primary outcome for each example an estimate from HINTS of the percentage of respondents who ever looked for cancer information using the Internet^{6}. Table 2 below presents the estimates from HINTS 2003 and HINTS 2005 for the overall population and for sociodemographic subgroups of general interest, as well as standard errors (the square roots of the jackknife variance estimates). Research based on a series of crosssectional surveys often emphasizes the results of the new survey but also includes testing for changes between survey iterations, i.e., examining trends in responses to a given survey item over time. This document focuses on three general goals and provides SAS/SUDAAN and STATA syntax examples for each when making inferences from multiple crosssectional surveys: Table 2 Estimates of percentages of adults who have ever looked for cancer information online.
Note that Goals 1 and 2 are relevant to test for differences or change in responses to survey items that are identical (or comparable) across years, while Goal 3 would be used to combine across years to obtain one larger sample size. ^{6}The exact derivation of the example percentage from the HINTS questionnaire items is given in Appendix A. 3. GOAL 1–ESTIMATING CHANGES WITHOUT CONTROLLING FOR OTHER FACTORSIt is easy to produce an estimate of change in characteristics between 2003 and 2005 and its corresponding variance estimate, because HINTS samples are drawn independently. Here we will label HINTS 2003 "year 1" and HINTS 2005 "year 2," and consider estimating a characteristic θ (e.g., a mean, percentage, regression coefficient, population standard deviation) in year s. We label the true value in year s as θ_{s}, the estimated value as _{s}, and the estimated variance (the square of the standard error) as ν(_{s}). The true change between years is Δ=θ_{2}θ_{1}, with consistent estimator =_{2}_{1} Because the samples are independent, the variance is the sum of the two variances, and a consistent variance estimator is ν()=ν(_{1})+ν(_{2})
Table 31 provides a summary of this information. Table 31 Summary of Estimating Changes Using Two Independent Surveys.
A hypothesis test for the null hypothesis of no change (θ_{1} = θ_{2}) can be tested against a onesided (θ_{1} < θ_{2}) or twosided (θ_{1} ≠ θ_{2}) alternative. The onesided alternative may be more appropriate when any change that occurs is expected to be positive change (such as in the degree of Internet usage). The test statistic is For national estimates (in contrast to subgroups) this can be referred to a tdistribution, using either the onesided t_{α,df} or the twosided t_{α/2,df}. Finding the correct number of degrees of freedom is not a trivial task. Appendix C provides a method (Welch’s method) for approximating the number of degrees of freedom, and shows why the t distribution on 49 degrees of freedom will be the most conservative (i.e., giving the widest confidence intervals), thereby reducing the likelihood of committing a Type I error. Using Welch’s method, the number of degrees of freedom will be something between 49 and 98. It should be noted that all of these tdistributions are close to each other, and close to the standard normal distribution (i.e., the corresponding percentiles are nearly equal). For most applications for HINTS, the Welch approximation assuming 49 degrees of freedom for each year will be reasonable. The degrees of freedom for the chisquare distribution can be no larger than the set of independent nonzero squares that underlies the variance estimator. Suppose for example that a particular estimate is restricted to a limited subgroup of the sample, so that many of the replicate squared deviations are negligibly close to zero (see the equation for vr() at the end of Section 1). In this case, a smaller number of degrees of freedom should be used^{7}. SAS/SUDAAN does allow the user to specify degrees of freedom if the user wishes to overrule the software’s choice. It should be noted that without manual specification the SAS/SUDAAN program uses as degrees of freedom the total number of replicates, and the STATA software uses as degrees of freedom: the total number of replicates minus 1 respectively. STATA doesn’t appear to allow for any respecification of degrees of freedom. These degrees of freedom are ‘liberal’ (just beyond the high end of the ‘acceptable’ range as per the Welch method). Table 32 on the next page presents onesided and twosided pvalues for the null hypothesis of no change between 2003 and 2005 in percentages of adults who had ever looked for cancer information online, both for all adults and for a number of socioeconomic subgroups. Table 33 presents corresponding confidence intervals. The Table 32 and 33 values were computed separate from the two HINTS data sets (using STATA and SAS/SUDAAN to do these separateyear computations), with differences, standard errors, pvalues, and confidence intervals computed in Excel, using a tdistribution on 98 degrees of freedom. If the pvalue percentage in the table is more than 5% (for example), one would not reject the hypothesis at the 5% significance level. The table shows that for all but four groups (less than high school, Hispanic, nonHispanic other, and $50,000–$74,999) we would reject the twosided test of no change at the 5% significance level. Note that the results for ‘all’ and for ‘nonHispanic Black’ can be used to test the hypotheses for Goal 1: Examples 1 and 2 respectively. The rows of the table allow the test of 19 hypotheses. If we wish to control the Type I error to 5% over all these hypotheses, we should use a significance level smaller than 5% for each individual test. The most conservative approach is the Bonferroni approach, in which the cutoff pvalue is 5% / 19, or 0.26% as a cutoff. Many of the pvalues in Table 32 pass this most conservative test. These can be confidently viewed as significant results. There are many other multiple comparisons tests that are less conservative than the Bonferroni approach; these are available in the current versions of both SAS and STATA for example. Table 32 Estimates of differences of percentages of adults who have ever looked for cancer information online, between 2003 and 2005.
One can compute onesided or twosided confidence intervals of the difference using similar considerations. The twosided confidence interval will be t_{α/2,df} is the twosided cutoff point using a t distribution on df degrees of freedom. Checking whether this confidence interval contains zero is equivalent to the twosided test of the null hypothesis of no change using the corresponding tdistribution. Table 33 presents twosided confidence intervals using the tdistribution for the change in percentage of adults who have ever looked for cancer information online (note that the first two columns of Table 33 give the same difference estimates as Table 32: they are included here as well as they are the center values of the confidence intervals from the twosided test). Again, the table shows that for all but four groups (less than high school, Hispanic, nonHispanic other, and $50,000–$74,999) we would reject the twosided test of no change at the 5% significance level (since the confidence intervals include zero for these four groups). Table 33 Confidence intervals for differences in percentages of adults who have ever looked for cancer information online, between 2003 and 2005.
^{7}A procedure recommended here is to consider as ‘negligible’ any replicate square in the set of replicate squares that is less than 1% of the median square, which will eliminate spurious ‘essentially nonzero’ squares.The software packages do not currently do this or anything similar to it, so the interested user will need to do this in a ‘manual’ way. 4. COMBINING THE DATA FILESFor Goal 1, it is only necessary to have the separate 2003 and 2005 data sets, compute the estimates and standard errors, compute differences by subtracting the two sets of estimates, and compute standard errors for those differences by adding the two variances. For Goals 2 and 3 and any more sophisticated analyses, combining the data files will be necessary. It turns out that if the data files are combined properly, the analyses of Goal 1 can also be easily reproduced using the combined data set. The main purpose of Goal 3 is to allow an augmented sample size: both years can be combined, virtually doubling the sample size. This will considerably improve precision for those characteristics which do not change much between the years. To create the combined data file, one can concatenate the 2003 and 2005 public use files so that the number of respondents in the combined data file is the sum of the respondents from the two individual data files. Two main tasks are required to combine the data files. First, variables used in the analyses should have the same name and values or categories in both data files. Section A of the Appendix describes how variables are redefined for the tasks in this document. Second, create a set of new statistical weights as shown in Table 4. There will be 101 weights in the combined data file: 1 final weight and 100 replicate weights. We label them NFWGT and NFWGT1–NFWGT100. The final weight (NFWGT) in the combined file is created by using the final weight (FWGT) from the respective surveys. For the first 50 replicate weights (NFWGT1, …, NFWGT50), we use replicate weights FWGT1, … ,FWGT50 from the sample persons from the HINTS 2003 survey, and we use the final weight FWGT (for all 50 replicates) for sample persons from the HINTS 2005 survey. Replicate weights equal to the final weight essentially result in zero sums of squares contributed to the variance estimator from those replicates. For the first 50 replicate weights, only the HINTS 2003 survey contributes variance. For the remaining 50 replicate weights (NFWGT51, …, NFWGT100), we use replicate weights FWGT1, …, FWGT50 from the sample persons from the HINTS 2005 survey, and we use the final weight FWGT (for all 50 replicates) for sample persons from the HINTS 2003 survey. For replicate weights 51 through 100, only the HINTS 2005 survey contributes variance. When the sums of squares for all 100 replicates are put together, the result is a sum of HINTS 2003 and HINTS 2005 variance, as desired (as the surveys are in fact independent). It is also necessary to define a YEAR field equal to 2003 (or 1) for HINTS 2003 sample members, and equal to 2005 (or 2) for HINTS 2005 sample members. The Goal 1 = _{2}  _{1}, with corresponding standard errors, test statistics, and confidence intervals, can be easily (and correctly) estimated from this combined data set using a contrast with the YEAR field (+1 for HINTS 2005 records and 1 for HINTS 2003 records). Appendix A provides SAS syntax for computing the new replicate weights^{9} and SUDAAN syntax for calculating the estimate of the difference^{10}. Appendix B provides corresponding STATA code^{11}.Table 4 Construction of statistical weights for the combined data file.
^{9}Under the title "Adjust replicate weights for the combined dataset". 5. GOAL 2—ESTIMATING CHANGES CONTROLLING FOR OTHER FACTORSThe change estimates presented in Section 3 are marginal changes: they are composites of changes in internet usage within specified subgroups, and changes in the percentages of subgroups. For example, suppose there is a change in Internet usage, but it is entirely because one group which had a higher Internet usage is now a larger percentage of the population (all groups within themselves had no change in Internet usage). In general, analysts want to be able to distinguish these compositional changes from actual trends in the characteristic of interest. In this section, we explore how to conduct analyses that search for ‘true’ non compositional changes in HINTS responses between 2003 and 2005. For example, Table 51 presents results from checking for 2003 to 2005 differences using logistic regression (with the binary dependent variable equal to 1 if ever Internet searched, and 0 otherwise). The beta coefficients represent effects on a logodds^{12} scale: the estimated odds ratios are also given (the transformed beta coefficients). Age, education level, and gender are also main effects in this model, so the year change coefficient can be interpreted as a yeartoyear change adjusting for changes in composition by age group, education level, and gender between the two years. The odds ratio for the 2005 to 2003 difference is 1.66: holding constant these other factors, the odds are 66% higher of ever having used the Internet to search for cancer information in 2005 as compared to 2003 (with a 95% confidence interval ranging from 48% to 87% higher). Since the 95% confidence interval for the odds ratio does not include 1, we would reject the hypothesis of no change for Goal 2 example 1. The table shows higher odds ratios for the younger age categories compared to the oldest category (65+) and lower odds ratios for the lower education groups compared to the highest education level group (‘college graduate or more’). The SAS/SUDAAN and STATA code for carrying out this calculation is given in Appendices A and B respectively. Table 51 Changes in percentages of adults who have ever looked for cancer information online between 2003 and 2005 controlling for age, education level, and gender.
To summarize, the model underlying Table 51 imposes a structure that yeartoyear differences only affect the intercept, and do not also show differences in the slopes for the other covariates. An interaction model can be used to test whether this assumption about the structure is correct. For example, there could have been more gain in ever having looked for cancer information online in the higher education groups than the lower education groups between 2003 and 2005. Table 52 presents the results of a model in which education level is interacted with year. The ‘Education Level 2003’ parameters represent differences between each education level and the baseline education level (‘college graduate or more’) for the baseline year 2003. These would be the estimates for the main effects for education level in a traditionally structured table (see for example Korn and Graubard [1999], Table 8.4.4) which puts main effects first. The ‘Education Level 2005 vs. 2003’ estimates are the differences in education level parameter estimates between 2003 and 2005: the interaction between year (2005 to 2003) and education level. Note that the confidence intervals for the odds ratio for the three interaction terms contain 1, which indicates that there is not a strong interaction between education and survey year in this case. More formal tests of the hypothesis of no interaction between education and survey year, such as the Wald test, are available using both SAS/SUDAAN and STATA. If the ‘Education Level 2003’ beta coefficients estimates and the ‘Education Level 2005 to 2003’ beta coefficient estimates are added together, the resultant summations for each education level are estimates for that education level (as against the baseline education level) for the year 2005. Table 52 Changes in percentages of adults who have ever looked for cancer information online between 2003 and 2005 controlling for age, education level, and gender, with a year vs. education level interaction.
For example, the odds ratio of 1.60 for 2005 vs. 2003 should be read in this case as a ratio of odds for 2005 college graduates to 2003 college graduates (college graduates are the referent category). The corresponding 2005 to 2003 ratio for ‘some college’ is 1.6 * (1.09) = 1.75, for ‘less than high school’ is 1.6 * (0.6) = 0.96. Table 52 allows one to ‘answer’ the Example 2 question under Goal 2 in Section 2. One can also extend the interactions between education level and the other predictors by doing separate analyses using education level as a subgroup. The slope coefficients are individual to that education level subgroup. Tables 531 through 534 present these results. Table 531 Changes in percentages of adults who have ever looked for cancer information online between 2003 and 2005 controlling for age and gender, subsetted to the education level subgroup ‘less than high school’.
Table 532 Changes in percentages of adults who have ever looked for cancer information online between 2003 and 2005 controlling for age and gender, subsetted to the education level subgroup ‘high school graduate’.
Table 533 Changes in percentages of adults who have ever looked for cancer information online between 2003 and 2005 controlling for age and gender, subsetted to the education level subgroup ‘some college’.
Table 534 Changes in percentages of adults who have ever looked for cancer information online between 2003 and 2005 controlling for age and gender, subsetted to the education level subgroup ‘college graduate or more’.
The survey year row of Table 531 through 534 can be used to test the null hypothesis of no change in ever looking for cancer information online for a different education group (Goal 2: Example 2); we reject the hypothesis at the 5% significance level if the 95% confidence interval for the odds ratio (for 2005) does not include 1. In this case, we reject the hypothesis of no change in ever looking for cancer information online for three of the four education groups (all but the ‘less than high school’ group). In summary, the analyses shown in Tables 531 through 534 are all useful. Table 52 provides a more concise summary of parameter estimates than Tables 531 through 534 under stronger assumptions, which may or may not be correct. Tables 531 through 534 show different beta coefficient estimates for survey year, age, and gender, while Table 52 shows a single estimate. Appendix A has SAS/SUDAAN code for carrying out these steps (indicated by table number), and Appendix B has STATA for carrying out these steps (also indicated by table number). ^{12}The odds of an event is the probability of an event divided by the complement of that probability, or p / (1p): e.g., an event probability of 1/2 corresponds to the event occurring with odds 1; an event probability of 2/3 corresponds to the event occurring with odds 2. An odds ratio of 1.6 between Events A and B means the following. Suppose Event A has an event probability of 1/3 (an odds ratio of 1/2).Then Event B will have an odds 1.6 times higher, or 0.8, which corresponds to an event probability of 44.5%. If Event A has an event probability of 1/2 (odds of 1), then Event B will have odds of 1.6 (1.6 times 1), which corresponds to an event probability of 61.5%. Note also that the probability p can be computed from the odds O as p = O / (1 + O).The logodds is the logarithm of the odds (putting the naturally multiplicative odds scale onto an additive scale). 6. GOAL 3–ESTIMATING AVERAGES BY COMBINING 2003 AND 2005 DATAWith two distinct surveys, we report separate values for two surveys or one value summarizing the entire time period. The one value for HINTS would be an average of the 2003 value and the 2005 value. If the distinct estimates from the two years are quite different, then reporting their average may not be a good idea, since the average may represent two distinct values or a single value. But in those cases when estimates from the two years do not differ much, then combining the data sets will certainly allow a considerable increase in precision (twice as large a sample size). This may be very useful for population subgroups in which the oneyear sample sizes are not very large. The average of two survey years may be estimated by using one of two easy steps: 1) using two separate data files, and 2) using the combined data file. In the first approach, we use the mean value θ_{m}= 0.5* (θ_{1} + θ_{2}) as the parameter of interest. Table 61 shows how we would compute the mean and its variance. The second method estimates the mean of the two years using the combined data with the new weights described in Section 4. The mean over the two years using these weights is implicitly estimating the parameter θ_{w}= (Ν_{1}θ_{1} + Ν_{2}θ_{2}) / (Ν_{1} + Ν_{2}), where Ν_{1} and Ν_{2} are the population sizes in the two surveys. When the population sizes in the two surveys are constant, the weighted mean reduces to the unweighted mean θ_{m}. Over a short period of time, the population size of most groups would change very little so that the two parameters should be similar; however, there may be subgroups increasing or decreasing in size rapidly by immigration. One advantage of using the combined data set with the new weights is that it takes into account change in population size. Table 61 Summary of estimating changes using two independent surveys.
Table 62 presents averages of the separateyear estimates^{13} for the percentage of adults who ever looked for cancer information online (θ_{m}). It should be noted in the computation of the confidence intervals Table 62 uses a symmetric tdistribution with 98 degrees of freedom^{14}.
Table 63 presents results for estimating θ_{w}: the weighted parameter. These calculations are all directly from the SAS/SUDAAN and STATA listings, and present the 95% confidence intervals presented by the SAS/SUDAAN package. Note that these confidence intervals are asymmetric, as the endpoints are reverse logistic transformations of symmetric confidence intervals on the logit scale. The STATA code provides similar results with slightly different degrees of freedom. Note that the STATA software provides a number of commands for confidence interval formation^{15}. As mentioned above, between HINTS 2003 and 2005, we would not expect large differences between the estimates and confidence intervals for the two parameters, θ_{m} and θ_{w}. Comparison of the results from Tables 62 and 63 shows this to be the case; the upper and lower bounds differ by less than one percentage point for every subgroup. Table 63 Percentages of adults who have ever looked for cancer information online using the combined 2003/2005 data file.
^{13}These separateyear estimates were computed using SAS/SUDAAN and STATA (both programs giving the same answer).The averaging was done in Excel. 7. OTHER ANALYSESThe previous sections concerned estimation and testing for a prevalence (mean) using one or two of the HINTS survey years. Although the prevalence is often the parameter of interest in public health, other characteristics such as a total may be of interest. Continuing the example considered in the first six sections, a researcher might be interested in the estimated total number of the population (or a subgroup) who had ever looked for cancer information using the Internet. The total number of users can be expressed as the product of the prevalence and the population size. Thus, the programs that were used to estimate prevalence can also be used to estimate the total by modification of the option statements in the program; for example, we could obtain estimates of the total in SAS/SUDAAN using PROC DESCRIPT. When using the data from two years, we need to distinguish between the total over both years (the sum of the two yearly totals) and the average total, which is half of the total over both years. The average total is more easily interpreted in most cases. The logistic regression analyses described in this users guide can easily be extended to ordinal logistic regression and linear regression models. In SUDAAN the appropriate command for ordinal/nominal multinomial logistic regression is PROC MULTILOG. In STATA, the corresponding command for ordered logistic regression is SVY:OLOGIT. REGRESS (SVY:REGRESS) is the proper command for linear regression in SAS/SUDAAN (STATA). REFERENCESBickel, P., and Doksum, K. A. (1977). Mathematical Statistics. Oakland, CA: HoldenDay. Cochran, W. G. (1977). Sampling Techniques, 3^{rd} ed. New York: John Wiley & Sons. Korn, E. L., and Graubard, B. I. (1999). Analysis of Health Surveys. New York: John Wiley & Sons. Lee, S., Davis, W. D., Nquyen, H. A., McNeel, T. S., Brick, J. M., FloresCervantes, I. (2007). Examining trends and averages using combined crosssectional survey data from multiple years. Available as a methodology paper on www.chis.ucla.edu. Oh, H. L., and Scheuren, F. S. (1983). Weighting adjustments for unit nonresponse, in Incomplete Data in Sample Surveys, Vol. II: Theory and Annotated Bibliography (W. G. Madow, I. Olkin, and D. B. Rubin, eds.), New York: Academic Press. Research Triangle Institute (2004). SUDAAN Example Manual: Release 9.0. Research Triangle Park, NC: Research Triangle Institute. StataCorp. 2007. Stata Statistical Software: Release 10. College Station, TX: StataCorp LP. Appendix A. SAS/SUDAAN Code for Carrying Out the Calculations/*HINTS Data  SAS Transport Files & Format Files*/ proc format; value agef value racef value educf value sexf value incomef value yesno run; VARIABLE RECODES data combined; label srvyYear="Survey Year"; /*Demographic Characteristics*/ label age=‘Age Group’; label race=‘Race/Ethnicity’; label income=‘Household Income’; /*InternetForCancer Recode  All Respondents*/ else if srvyYear=2 then do;**2005 Recode; /*Adjust Replicate Weights for the combined dataset*/ else if srvyYear=2 then do;***2005; run; SUDAAN COMPUTATIONS/*SUDAAN users are given the option to select the denominator degrees of freedom within each procedure. The default degrees of freedom is not optimal for computations involving differences in percentages and averages over years using combined data sets. More precise results may be obtained by using the Welch approximation (see Appendix C). Once computed, the approximation can be entered into SUDAAN using the DDF= option. In order to mirror the STATA figures, the denominator degrees of freedom have been set to 99. */ GOAL 1—Estimating Changes Without Controlling for Other Factors. (See section 3.) /*Test for total difference across years using combined dataset.*/ /*View percentages by specified years using combined dataset.*/ /*Test for differences across years for a subset of demographic va riables using combined dataset.*/ GOAL 2—Estimating Changes Controlling for Other Factors. (See section 5.) /*Assess differences across years while controlling for cova riates—education, age, and gender—using the combined dataset. See Table 51.*/ /*Assess differences across years while controlling for cova riates—education, age, and gender—using the combined dataset. Includes an interaction term to test for differential change by levels of education. See Table 52.*/ /*Assess differences across years for each level of education while controlling for age and gender.*/ proc rlogist data= combined design=jackknife ddf=99; proc rlogist data= combined design=jackknife ddf=99; proc rlogist data= combined design=jackknife ddf=99; GOAL 3—Estimating Averages by Combining 2003 and 2005 Data. (See section 6.) /*Obtain weighted percentages by demographic subgroup using combined dataset. See Table 63.*/ APPENDIX B. STATA CODE FOR CARRYING OUT THE CALCULATIONSMANIPULATE 2003 DATA log using "<insert file path name>\data step.log", replace *** Create the demographic variables recode spage (18/34=1 "1834") (35/49=2""3549") (50/64=3 "5064") (65/96=4 "65 +") (nonmissing=.), recode raceethn (1=3 "Hispanic") (2=1 "NH White") (3=2 "NH Black") (4/7=4 "NH Other") (nonmissing=.), recode hhincb (1=1 "<$25K") (2 3=2 "$25K<$50K") (4=3 "$50K<$75K") (5=4 "$75K ") (nonmissing=.), recode educa (1=1 "Less than High School Grad") (2=2 "High School Grad") (3=3 "Some College") (4=4 "College Grad") (nonmissing=.), generate(educ) * Create the variable internetforcancer * Create the replicate weights for the combined data foreach i of numlist 1/50 { foreach i of numlist 51/100 { save hints, replace MANIPULATE 2005 DATA use "<insert file path name>\hints2005.d2006_06_02.public.dta", clear keep spgender spage raceethn hhincb educa fwgt fwgt1fwgt50 bmi ca12wherelookcancerinfo ca08seekcancerinfo ga1useinternet ca15internetforcancer generate srvyyear = 2 * Create the demographic variables * Create the variable internetforcancer * Respondents whose last search for cancer information was online * Respondents who never looked for health information online * Respondents who have used the internet for general health information replace internetforcancer = 2  ca15internetforcancer if missing(internetforcancer) & (ca15internetforcancer == 1  ca15internetforcancer == 2) * Create the replicate weights for the combined data foreach i of numlist 1/50 { foreach i of numlist 51/100 { COMBINE 2003 and 2005 DATASETS append using hints STATA COMPUTATIONS *In Stata 10, the user can not specify the design degrees of freedom. GOAL 1—Estimating Changes Without Controlling for Other Factors. (See section 3.) ***The following codes recreate the yearly percentages, differences, standard errors, and twosided p * Test for differences across years using combined data  by age group * among those with age 1834 * among those with age 3549 * among those with age 5064 * among those with age 65+ * Test for differences across years using combined data  by education group * among those less than high school * among those high school graduate * among those some college * among those college graduate * Test for differences across years using combined data  by sex * among males * among females * Test for differences across years using combined data  by income group * among those < $25K * among those $25K < $50K * among those $50K < $75K * among those $75K+ * Test for differences across years using combined data  by race group * among NH white * among NH black * among Hispanic * among NH other GOAL 2—Estimating Changes Controlling for Other Factors. (See section 5.) *** Logistic Regression – adjusted by education, age and sex. (Table 51) *** Logistic Regression  adjusted by education, age, sex and i.srvyyear*i.educ. (Table 52) *** Logistic Regression – adjusted by age and sex, stratified by education. (Table 53) GOAL 2—Estimating Changes Controlling for Other Factors. (See section 5.) xi: svy, or subpop(selectedgroup): logit internetforcancer i.srvyyear i.age i.sex * among those high school graduate * among those some college * among those college graduate GOAL 3—Estimating Averages by Combining 2003 and 2005 Data. (See section 6.) *** Obtain weighted percentages using combined dataset. (Table 63) * Estimate using the combined data by education group * Estimate using the combined data by sex group * Estimate using the combined data by income group * Estimate using the combined data by race group APPENDIX C. COMPUTING DEGREES OF FREEDOMFor purposes of computing appropriate degrees of freedom for the estimator of HINTS 2003 and HINTS 2005 differences (and of combinations in general such as averages across years), we can assume as an approximation that both samples are simple random samples of size 50 (corresponding to the 50 replicates: each replicate provides a ‘pseudo sample unit’) from a normal distribution^{16}. We have independent estimates _{1} and _{2} with means θ_{1} and θ_{2} and variances Var(_{1}) and Var(_{2} with means θ_{1}). The estimator of the difference Δ=θ_{2}θ_{1} is = _{2}_{1}, with estimator of variance v() = v(_{1}) +v (_{2}). v(_{1}) and v(_{2}) have n_{1}  1 and n_{2}  1 degrees of freedom respectively^{17}, where n_{1} and n_{2} are the number of replicates for year 1 and year 2 respectively. The estimating equation referred to the tdistribution in this case is . The method for computing the degrees of freedom of the difference of normallydistributed simple random sample estimators with unequal variances from independent surveys is taken from Bickel and Doksum (1977). Section 6.4C recommends the Welch approximation, which computes as the degrees of freedom k for the estimating equation In our application n_{1} and n_{2} are both 50. If v(_{1}) and v(_{2}) are also both equal, then c = 1/2 and That is the maximum value of k. If v(_{1}) is much smaller, or much larger, than v(_{2}), then 49 is the minimum value of k. Thus 49 is the ‘conservative’ approximation for the degrees of freedom: it gives the widest confidence intervals (using the t distribution on 49 degrees of freedom). If v(_{1}) and v(_{2}) are unequal and both in the same order of magnitude, then Welch’s approximation value can be used to generate an appropriate k, which will be in the range [49,98]. ^{16}The pseudovalues may not necessarily have a normal distribution: it is good practice to check this assumption and make sure there is not kurtosis which may reduce the effective degrees of freedom. CANCER INFORMATION AND RESOURCESPATIENTORIENTED INFORMATION NCI’s Cancer Information Service (CIS) Other NCI or DHHS Sources of Cancer Information American Cancer Society (ACS) FEDERALLYSPONSORED PROGRAM PLANNING RESOURCES Cancer Control P.L.A.N.E.T. Researchtested Intervention Programs (RTIPs) Guide to Community Preventive Services RESEARCH TOOLS AND RESOURCES Behavioral Risk Factor Surveillance System (BRFSS) National Health Interview Survey (NHIS) Current Population Survey (CPS) Surveillance, Epidemiology, and End Results (SEER) Pew Internet and American Life Project Health Information Natitonal Survey (HINTS) U.S. Department of Health and Human Services (DHHS) NIH Publication No. 086435
