Lately this blog has had a lot of descriptive posts about my life but not much that stretches my R abilities. Later this year my organization is holding its research day, so I decided to start practicing on a dataset that I last used about 7 years ago. The California Health Interview Survey is a survey of households that is weighted to be representative of the non-institutionalized population of California. This dataset has a few wrinkles in it that require a higher level of R programming so I thought I would try to stretch my rusty skills a little.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(foreign)
Fortunately when I last used this dataset in 2013, I had commented my code pretty well so I was able to remember how to load in the Stata file.
chis <- read.dta("/Users/mching/Dropbox/Mike/CHIS18/CHILD.dta",
convert.dates = TRUE, convert.factors = TRUE,
missing.type = TRUE, convert.underscore = TRUE,
warn.missing.labels = FALSE)
Unweighted Data
Here’s what the data dictionary said about a variable documenting children who needed dental care in the last year.
| Frequency | % | ||
|---|---|---|---|
| -1 | Inapplicable | 58 | 3.66 |
| 1 | Yes | 67 | 4.22 |
| 2 | No | 1461 | 92.12 |
And here’s what the dataset I loaded had to say about this variable.
table(chis$cb27)
##
## NOT ASCERTAINED DON'T KNOW
## 0 0
## REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED
## 0 0
## PROXY SKIPPED INAPPLICABLE
## 0 58
## YES NO
## 67 1461
round(prop.table(table(chis$cb27))*100, 2)
##
## NOT ASCERTAINED DON'T KNOW
## 0.00 0.00
## REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED
## 0.00 0.00
## PROXY SKIPPED INAPPLICABLE
## 0.00 3.66
## YES NO
## 4.22 92.12
As you can see from above, the unweighted data were the same as in the data dictionary.
Weighted Data
In the survey package you have to create a survey object that can be used to run the survey methods. Code for creating the survey object was nodified from AJ Damico’s Analyze Survey Data for Free website.
chis_design <-
svrepdesign(
data = chis ,
weights = ~ rakedw0 ,
repweights = "rakedw[1-9]" ,
type = "other" ,
scale = 1 ,
rscales = 1 ,
mse = TRUE
)
I generated a quick table to see if the survey method for calculating the population estimates worked.
round(svytable(~cb27, chis_design), -3)
## cb27
## NOT ASCERTAINED DON'T KNOW
## 0 0
## REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED
## 0 0
## PROXY SKIPPED INAPPLICABLE
## 0 321000
## YES NO
## 247000 5477000
I checked it against AskCHIS, the web interface to the CHIS dataset. Here’s what it returned for this variable:
| Needed Dental Care | Percentage | Confidence Interval (95%) | Population |
|---|---|---|---|
| Needed dental care | 4.3% | 2.5 - 6.2 | 247,000 |
| Did not need dental care | 95.7% | 93.8 - 97.5 | 5,477,000 |
| Total | 100.0% | - | 5,724,000 |
The estimated population number of children who needed/didn’t neet care was exactly the same as what I calculated. The ratios were not the same because of all the inapplicable results.
svymean(~cb27, chis_design)
## mean SE
## cb27NOT ASCERTAINED 0.000000 0.0000
## cb27DON'T KNOW 0.000000 0.0000
## cb27REFUSED 0.000000 0.0000
## cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.000000 0.0000
## cb27PROXY SKIPPED 0.000000 0.0000
## cb27INAPPLICABLE 0.053042 0.0101
## cb27YES 0.040885 0.0088
## cb27NO 0.906073 0.0141
Getting the weighted frequencies on just the yes or no responses was tricky. It required creating a variable for the Inapplicable patients and then creating a mean based on that.
chis$cb27_inapplicable <- ifelse(chis$cb27 == "INAPPLICABLE", T, F)
chis_design <-
svrepdesign(
data = chis ,
weights = ~ rakedw0 ,
repweights = "rakedw[1-9]" ,
type = "other" ,
scale = 1 ,
rscales = 1 ,
mse = TRUE
)
svyby(~cb27, ~cb27_inapplicable, svymean, design = chis_design)
## cb27_inapplicable cb27NOT ASCERTAINED cb27DON'T KNOW cb27REFUSED
## FALSE FALSE 0 0 0
## TRUE TRUE 0 0 0
## cb27ADULT/HOUSEHOLD INFO NOT COLLECTED cb27PROXY SKIPPED cb27INAPPLICABLE
## FALSE 0 0 0
## TRUE 0 0 1
## cb27YES cb27NO se1 se2 se3 se4 se5 se6 se7 se8
## FALSE 0.04317531 0.9568247 0 0 0 0 0 0 0.009349207 0.009349207
## TRUE 0.00000000 0.0000000 0 0 0 0 0 0 0.000000000 0.000000000
The weighted means were the same as what AskCHIS said (yes = 4.3%, no = 95.7%). The confidence intervals were very slightly off from AskCHIS (yes = 2.5-6.2, no = 93.8-97.5) but I figured they were close enough to not matter too much.
confint(svyby(~cb27, ~cb27_inapplicable, svymean, design = chis_design))
## 2.5 % 97.5 %
## FALSE:cb27NOT ASCERTAINED 0.0000000 0.00000000
## TRUE:cb27NOT ASCERTAINED 0.0000000 0.00000000
## FALSE:cb27DON'T KNOW 0.0000000 0.00000000
## TRUE:cb27DON'T KNOW 0.0000000 0.00000000
## FALSE:cb27REFUSED 0.0000000 0.00000000
## TRUE:cb27REFUSED 0.0000000 0.00000000
## FALSE:cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.0000000 0.00000000
## TRUE:cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.0000000 0.00000000
## FALSE:cb27PROXY SKIPPED 0.0000000 0.00000000
## TRUE:cb27PROXY SKIPPED 0.0000000 0.00000000
## FALSE:cb27INAPPLICABLE 0.0000000 0.00000000
## TRUE:cb27INAPPLICABLE 1.0000000 1.00000000
## FALSE:cb27YES 0.0248512 0.06149942
## TRUE:cb27YES 0.0000000 0.00000000
## FALSE:cb27NO 0.9385006 0.97514880
## TRUE:cb27NO 0.0000000 0.00000000
I decided to look at one other variable, reading to the child (variable cg14).
chis$cg14 <- droplevels(chis$cg14)
table(chis$cg14)
##
## INAPPLICABLE EVERY DAY 3-6 DAYS 1-2 DAYS NEVER
## 934 460 126 50 16
I used a different method to subset this time, dropping all rows where cg14 was inapplicable.
chis_design <-
svrepdesign(
data = chis ,
weights = ~ rakedw0 ,
repweights = "rakedw[1-9]" ,
type = "other" ,
scale = 1 ,
rscales = 1 ,
mse = TRUE
)
chis_reading <- subset(chis_design, chis$cg14 != "INAPPLICABLE")
This resulted in nicer to read output.
round(svytotal(~cg14, chis_reading), -3)
## total SE
## cg14INAPPLICABLE 0 0
## cg14EVERY DAY 1852000 110046
## cg143-6 DAYS 559000 91874
## cg141-2 DAYS 376000 77698
## cg14NEVER 144000 45296
round(svymean(~cg14, chis_reading) * 100, 1)
## mean SE
## cg14INAPPLICABLE 0.0 0.0000
## cg14EVERY DAY 63.2 0.0356
## cg143-6 DAYS 19.1 0.0310
## cg141-2 DAYS 12.8 0.0267
## cg14NEVER 4.9 0.0155
round(confint(svymean(~cg14, chis_reading)) * 100, 1)
## 2.5 % 97.5 %
## cg14INAPPLICABLE 0.0 0.0
## cg14EVERY DAY 56.2 70.2
## cg143-6 DAYS 13.0 25.1
## cg141-2 DAYS 7.6 18.1
## cg14NEVER 1.9 7.9
I confirmed that the point estimates matched the data from AskCHIS, but again the confidence intervals were off, this time by a bit more. With the rounding, it was more apparent that the confidence intervals calculated by this method were slightly tighter than in AskCHIS.
| Days per week reading books with child (0-5 years) | Percent | 95% c.i. | Population |
| Every day | 63.2% | ( 56.1 - 70.3 ) | 1,852,000 |
| 3 to 6 days of the week | 19.1% | ( 12.9 - 25.2 ) | 559,000 |
| 1 to 2 days of the week | 12.8% | ( 7.5 - 18.1 ) | 376,000 |
| Never | 4.9% * | ( 1.8 - 8.0 ) | 144,000 |
* unstable
I read a little more and it looks like the author of the survey package created another function to calculate confidence intervals for survey objects. These were a lot closer even though the documentation says this “mean” method is the same as calling confint(svymean()).
cimethod = "mean"
svyciprop(~I(cg14 == "EVERY DAY"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "EVERY DAY") 0.632 0.561 0.7
svyciprop(~I(cg14 == "3-6 DAYS"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "3-6 DAYS") 0.191 0.129 0.25
svyciprop(~I(cg14 == "1-2 DAYS"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "1-2 DAYS") 0.128 0.075 0.18
svyciprop(~I(cg14 == "NEVER"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "NEVER") 0.0490 0.0181 0.08
Discussion
I was able to load the CHIS 2018 child dataset and confirm that I could use the survey methods to replicate the results from the official CHIS web interface. I learned that calculating proportion confidence intervals may be more accurate using the survey confidence interval function svycipropthan the usual confint method.
I didn’t do too much with the data, but next time I will pick up with doing some plots and regressions.