Lately this blog has had a lot of descriptive posts about my life but not much that stretches my R abilities. Later this year my organization is holding its research day, so I decided to start practicing on a dataset that I last used about 7 years ago. The California Health Interview Survey is a survey of households that is weighted to be representative of the non-institutionalized population of California. This dataset has a few wrinkles in it that require a higher level of R programming so I thought I would try to stretch my rusty skills a little.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(foreign)

Fortunately when I last used this dataset in 2013, I had commented my code pretty well so I was able to remember how to load in the Stata file.

chis <- read.dta("/Users/mching/Dropbox/Mike/CHIS18/CHILD.dta", 
                 convert.dates = TRUE, convert.factors = TRUE, 
                 missing.type = TRUE, convert.underscore = TRUE, 
                 warn.missing.labels = FALSE)

Unweighted Data

Here’s what the data dictionary said about a variable documenting children who needed dental care in the last year.

		Frequency	%
-1	Inapplicable	58	3.66
1	Yes	67	4.22
2	No	1461	92.12

And here’s what the dataset I loaded had to say about this variable.

table(chis$cb27)

## 
##                    NOT ASCERTAINED                         DON'T KNOW 
##                                  0                                  0 
##                            REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED 
##                                  0                                  0 
##                      PROXY SKIPPED                       INAPPLICABLE 
##                                  0                                 58 
##                                YES                                 NO 
##                                 67                               1461

round(prop.table(table(chis$cb27))*100, 2)

## 
##                    NOT ASCERTAINED                         DON'T KNOW 
##                               0.00                               0.00 
##                            REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED 
##                               0.00                               0.00 
##                      PROXY SKIPPED                       INAPPLICABLE 
##                               0.00                               3.66 
##                                YES                                 NO 
##                               4.22                              92.12

As you can see from above, the unweighted data were the same as in the data dictionary.

Weighted Data

In the survey package you have to create a survey object that can be used to run the survey methods. Code for creating the survey object was nodified from AJ Damico’s Analyze Survey Data for Free website.

chis_design <- 
    svrepdesign( 
        data = chis , 
        weights = ~ rakedw0 , 
        repweights = "rakedw[1-9]" , 
        type = "other" , 
        scale = 1 , 
        rscales = 1 , 
        mse = TRUE 
    )

I generated a quick table to see if the survey method for calculating the population estimates worked.

round(svytable(~cb27, chis_design), -3)

## cb27
##                    NOT ASCERTAINED                         DON'T KNOW 
##                                  0                                  0 
##                            REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED 
##                                  0                                  0 
##                      PROXY SKIPPED                       INAPPLICABLE 
##                                  0                             321000 
##                                YES                                 NO 
##                             247000                            5477000

I checked it against AskCHIS, the web interface to the CHIS dataset. Here’s what it returned for this variable:

Needed Dental Care	Percentage	Confidence Interval (95%)	Population
Needed dental care	4.3%	2.5 - 6.2	247,000
Did not need dental care	95.7%	93.8 - 97.5	5,477,000
Total	100.0%	-	5,724,000

The estimated population number of children who needed/didn’t neet care was exactly the same as what I calculated. The ratios were not the same because of all the inapplicable results.

svymean(~cb27, chis_design)

##                                            mean     SE
## cb27NOT ASCERTAINED                    0.000000 0.0000
## cb27DON'T KNOW                         0.000000 0.0000
## cb27REFUSED                            0.000000 0.0000
## cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.000000 0.0000
## cb27PROXY SKIPPED                      0.000000 0.0000
## cb27INAPPLICABLE                       0.053042 0.0101
## cb27YES                                0.040885 0.0088
## cb27NO                                 0.906073 0.0141

Getting the weighted frequencies on just the yes or no responses was tricky. It required creating a variable for the Inapplicable patients and then creating a mean based on that.

chis$cb27_inapplicable <- ifelse(chis$cb27 == "INAPPLICABLE", T, F)

chis_design <- 
    svrepdesign( 
        data = chis , 
        weights = ~ rakedw0 , 
        repweights = "rakedw[1-9]" , 
        type = "other" , 
        scale = 1 , 
        rscales = 1 , 
        mse = TRUE 
    )

svyby(~cb27, ~cb27_inapplicable, svymean, design = chis_design)

##       cb27_inapplicable cb27NOT ASCERTAINED cb27DON'T KNOW cb27REFUSED
## FALSE             FALSE                   0              0           0
## TRUE               TRUE                   0              0           0
##       cb27ADULT/HOUSEHOLD INFO NOT COLLECTED cb27PROXY SKIPPED cb27INAPPLICABLE
## FALSE                                      0                 0                0
## TRUE                                       0                 0                1
##          cb27YES    cb27NO se1 se2 se3 se4 se5 se6         se7         se8
## FALSE 0.04317531 0.9568247   0   0   0   0   0   0 0.009349207 0.009349207
## TRUE  0.00000000 0.0000000   0   0   0   0   0   0 0.000000000 0.000000000

The weighted means were the same as what AskCHIS said (yes = 4.3%, no = 95.7%). The confidence intervals were very slightly off from AskCHIS (yes = 2.5-6.2, no = 93.8-97.5) but I figured they were close enough to not matter too much.

confint(svyby(~cb27, ~cb27_inapplicable, svymean, design = chis_design))

##                                                  2.5 %     97.5 %
## FALSE:cb27NOT ASCERTAINED                    0.0000000 0.00000000
## TRUE:cb27NOT ASCERTAINED                     0.0000000 0.00000000
## FALSE:cb27DON'T KNOW                         0.0000000 0.00000000
## TRUE:cb27DON'T KNOW                          0.0000000 0.00000000
## FALSE:cb27REFUSED                            0.0000000 0.00000000
## TRUE:cb27REFUSED                             0.0000000 0.00000000
## FALSE:cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.0000000 0.00000000
## TRUE:cb27ADULT/HOUSEHOLD INFO NOT COLLECTED  0.0000000 0.00000000
## FALSE:cb27PROXY SKIPPED                      0.0000000 0.00000000
## TRUE:cb27PROXY SKIPPED                       0.0000000 0.00000000
## FALSE:cb27INAPPLICABLE                       0.0000000 0.00000000
## TRUE:cb27INAPPLICABLE                        1.0000000 1.00000000
## FALSE:cb27YES                                0.0248512 0.06149942
## TRUE:cb27YES                                 0.0000000 0.00000000
## FALSE:cb27NO                                 0.9385006 0.97514880
## TRUE:cb27NO                                  0.0000000 0.00000000

I decided to look at one other variable, reading to the child (variable cg14).

chis$cg14 <- droplevels(chis$cg14)

table(chis$cg14)

## 
## INAPPLICABLE    EVERY DAY     3-6 DAYS     1-2 DAYS        NEVER 
##          934          460          126           50           16

I used a different method to subset this time, dropping all rows where cg14 was inapplicable.

chis_design <- 
    svrepdesign( 
        data = chis , 
        weights = ~ rakedw0 , 
        repweights = "rakedw[1-9]" , 
        type = "other" , 
        scale = 1 , 
        rscales = 1 , 
        mse = TRUE 
    )

chis_reading <- subset(chis_design, chis$cg14 != "INAPPLICABLE")

This resulted in nicer to read output.

round(svytotal(~cg14, chis_reading), -3)

##                    total     SE
## cg14INAPPLICABLE       0      0
## cg14EVERY DAY    1852000 110046
## cg143-6 DAYS      559000  91874
## cg141-2 DAYS      376000  77698
## cg14NEVER         144000  45296

round(svymean(~cg14, chis_reading) * 100, 1)

##                  mean     SE
## cg14INAPPLICABLE  0.0 0.0000
## cg14EVERY DAY    63.2 0.0356
## cg143-6 DAYS     19.1 0.0310
## cg141-2 DAYS     12.8 0.0267
## cg14NEVER         4.9 0.0155

round(confint(svymean(~cg14, chis_reading)) * 100, 1)

##                  2.5 % 97.5 %
## cg14INAPPLICABLE   0.0    0.0
## cg14EVERY DAY     56.2   70.2
## cg143-6 DAYS      13.0   25.1
## cg141-2 DAYS       7.6   18.1
## cg14NEVER          1.9    7.9

I confirmed that the point estimates matched the data from AskCHIS, but again the confidence intervals were off, this time by a bit more. With the rounding, it was more apparent that the confidence intervals calculated by this method were slightly tighter than in AskCHIS.

Days per week reading books with child (0-5 years)	Percent	95% c.i.	Population
Every day	63.2%	( 56.1 - 70.3 )	1,852,000
3 to 6 days of the week	19.1%	( 12.9 - 25.2 )	559,000
1 to 2 days of the week	12.8%	( 7.5 - 18.1 )	376,000
Never	4.9% *	( 1.8 - 8.0 )	144,000

* unstable

I read a little more and it looks like the author of the survey package created another function to calculate confidence intervals for survey objects. These were a lot closer even though the documentation says this “mean” method is the same as calling confint(svymean()).

cimethod = "mean"
svyciprop(~I(cg14 == "EVERY DAY"), design = chis_reading, method = cimethod, level = 0.95)

##                               2.5% 97.5%
## I(cg14 == "EVERY DAY") 0.632 0.561   0.7

svyciprop(~I(cg14 == "3-6 DAYS"), design = chis_reading, method = cimethod, level = 0.95)

##                              2.5% 97.5%
## I(cg14 == "3-6 DAYS") 0.191 0.129  0.25

svyciprop(~I(cg14 == "1-2 DAYS"), design = chis_reading, method = cimethod, level = 0.95)

##                              2.5% 97.5%
## I(cg14 == "1-2 DAYS") 0.128 0.075  0.18

svyciprop(~I(cg14 == "NEVER"), design = chis_reading, method = cimethod, level = 0.95)

##                             2.5% 97.5%
## I(cg14 == "NEVER") 0.0490 0.0181  0.08

Discussion

I was able to load the CHIS 2018 child dataset and confirm that I could use the survey methods to replicate the results from the official CHIS web interface. I learned that calculating proportion confidence intervals may be more accurate using the survey confidence interval function svycipropthan the usual confint method.

I didn’t do too much with the data, but next time I will pick up with doing some plots and regressions.

health