Lately this blog has had a lot of descriptive posts about my life but not much that stretches my R abilities. Later this year my organization is holding its research day, so I decided to start practicing on a dataset that I last used about 7 years ago. The California Health Interview Survey is a survey of households that is weighted to be representative of the non-institutionalized population of California. This dataset has a few wrinkles in it that require a higher level of R programming so I thought I would try to stretch my rusty skills a little.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
library(foreign)
Fortunately when I last used this dataset in 2013, I had commented my code pretty well so I was able to remember how to load in the Stata file.
chis <- read.dta("/Users/mching/Dropbox/Mike/CHIS18/CHILD.dta",
convert.dates = TRUE, convert.factors = TRUE,
missing.type = TRUE, convert.underscore = TRUE,
warn.missing.labels = FALSE)
Unweighted Data
Here’s what the data dictionary said about a variable documenting children who needed dental care in the last year.
Frequency | % | ||
---|---|---|---|
-1 | Inapplicable | 58 | 3.66 |
1 | Yes | 67 | 4.22 |
2 | No | 1461 | 92.12 |
And here’s what the dataset I loaded had to say about this variable.
table(chis$cb27)
##
## NOT ASCERTAINED DON'T KNOW
## 0 0
## REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED
## 0 0
## PROXY SKIPPED INAPPLICABLE
## 0 58
## YES NO
## 67 1461
round(prop.table(table(chis$cb27))*100, 2)
##
## NOT ASCERTAINED DON'T KNOW
## 0.00 0.00
## REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED
## 0.00 0.00
## PROXY SKIPPED INAPPLICABLE
## 0.00 3.66
## YES NO
## 4.22 92.12
As you can see from above, the unweighted data were the same as in the data dictionary.
Weighted Data
In the survey package you have to create a survey object that can be used to run the survey methods. Code for creating the survey object was nodified from AJ Damico’s Analyze Survey Data for Free website.
chis_design <-
svrepdesign(
data = chis ,
weights = ~ rakedw0 ,
repweights = "rakedw[1-9]" ,
type = "other" ,
scale = 1 ,
rscales = 1 ,
mse = TRUE
)
I generated a quick table to see if the survey method for calculating the population estimates worked.
round(svytable(~cb27, chis_design), -3)
## cb27
## NOT ASCERTAINED DON'T KNOW
## 0 0
## REFUSED ADULT/HOUSEHOLD INFO NOT COLLECTED
## 0 0
## PROXY SKIPPED INAPPLICABLE
## 0 321000
## YES NO
## 247000 5477000
I checked it against AskCHIS, the web interface to the CHIS dataset. Here’s what it returned for this variable:
Needed Dental Care | Percentage | Confidence Interval (95%) | Population |
---|---|---|---|
Needed dental care | 4.3% | 2.5 - 6.2 | 247,000 |
Did not need dental care | 95.7% | 93.8 - 97.5 | 5,477,000 |
Total | 100.0% | - | 5,724,000 |
The estimated population number of children who needed/didn’t neet care was exactly the same as what I calculated. The ratios were not the same because of all the inapplicable results.
svymean(~cb27, chis_design)
## mean SE
## cb27NOT ASCERTAINED 0.000000 0.0000
## cb27DON'T KNOW 0.000000 0.0000
## cb27REFUSED 0.000000 0.0000
## cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.000000 0.0000
## cb27PROXY SKIPPED 0.000000 0.0000
## cb27INAPPLICABLE 0.053042 0.0101
## cb27YES 0.040885 0.0088
## cb27NO 0.906073 0.0141
Getting the weighted frequencies on just the yes or no responses was tricky. It required creating a variable for the Inapplicable patients and then creating a mean based on that.
chis$cb27_inapplicable <- ifelse(chis$cb27 == "INAPPLICABLE", T, F)
chis_design <-
svrepdesign(
data = chis ,
weights = ~ rakedw0 ,
repweights = "rakedw[1-9]" ,
type = "other" ,
scale = 1 ,
rscales = 1 ,
mse = TRUE
)
svyby(~cb27, ~cb27_inapplicable, svymean, design = chis_design)
## cb27_inapplicable cb27NOT ASCERTAINED cb27DON'T KNOW cb27REFUSED
## FALSE FALSE 0 0 0
## TRUE TRUE 0 0 0
## cb27ADULT/HOUSEHOLD INFO NOT COLLECTED cb27PROXY SKIPPED cb27INAPPLICABLE
## FALSE 0 0 0
## TRUE 0 0 1
## cb27YES cb27NO se1 se2 se3 se4 se5 se6 se7 se8
## FALSE 0.04317531 0.9568247 0 0 0 0 0 0 0.009349207 0.009349207
## TRUE 0.00000000 0.0000000 0 0 0 0 0 0 0.000000000 0.000000000
The weighted means were the same as what AskCHIS said (yes = 4.3%, no = 95.7%). The confidence intervals were very slightly off from AskCHIS (yes = 2.5-6.2, no = 93.8-97.5) but I figured they were close enough to not matter too much.
confint(svyby(~cb27, ~cb27_inapplicable, svymean, design = chis_design))
## 2.5 % 97.5 %
## FALSE:cb27NOT ASCERTAINED 0.0000000 0.00000000
## TRUE:cb27NOT ASCERTAINED 0.0000000 0.00000000
## FALSE:cb27DON'T KNOW 0.0000000 0.00000000
## TRUE:cb27DON'T KNOW 0.0000000 0.00000000
## FALSE:cb27REFUSED 0.0000000 0.00000000
## TRUE:cb27REFUSED 0.0000000 0.00000000
## FALSE:cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.0000000 0.00000000
## TRUE:cb27ADULT/HOUSEHOLD INFO NOT COLLECTED 0.0000000 0.00000000
## FALSE:cb27PROXY SKIPPED 0.0000000 0.00000000
## TRUE:cb27PROXY SKIPPED 0.0000000 0.00000000
## FALSE:cb27INAPPLICABLE 0.0000000 0.00000000
## TRUE:cb27INAPPLICABLE 1.0000000 1.00000000
## FALSE:cb27YES 0.0248512 0.06149942
## TRUE:cb27YES 0.0000000 0.00000000
## FALSE:cb27NO 0.9385006 0.97514880
## TRUE:cb27NO 0.0000000 0.00000000
I decided to look at one other variable, reading to the child (variable cg14).
chis$cg14 <- droplevels(chis$cg14)
table(chis$cg14)
##
## INAPPLICABLE EVERY DAY 3-6 DAYS 1-2 DAYS NEVER
## 934 460 126 50 16
I used a different method to subset this time, dropping all rows where cg14 was inapplicable.
chis_design <-
svrepdesign(
data = chis ,
weights = ~ rakedw0 ,
repweights = "rakedw[1-9]" ,
type = "other" ,
scale = 1 ,
rscales = 1 ,
mse = TRUE
)
chis_reading <- subset(chis_design, chis$cg14 != "INAPPLICABLE")
This resulted in nicer to read output.
round(svytotal(~cg14, chis_reading), -3)
## total SE
## cg14INAPPLICABLE 0 0
## cg14EVERY DAY 1852000 110046
## cg143-6 DAYS 559000 91874
## cg141-2 DAYS 376000 77698
## cg14NEVER 144000 45296
round(svymean(~cg14, chis_reading) * 100, 1)
## mean SE
## cg14INAPPLICABLE 0.0 0.0000
## cg14EVERY DAY 63.2 0.0356
## cg143-6 DAYS 19.1 0.0310
## cg141-2 DAYS 12.8 0.0267
## cg14NEVER 4.9 0.0155
round(confint(svymean(~cg14, chis_reading)) * 100, 1)
## 2.5 % 97.5 %
## cg14INAPPLICABLE 0.0 0.0
## cg14EVERY DAY 56.2 70.2
## cg143-6 DAYS 13.0 25.1
## cg141-2 DAYS 7.6 18.1
## cg14NEVER 1.9 7.9
I confirmed that the point estimates matched the data from AskCHIS, but again the confidence intervals were off, this time by a bit more. With the rounding, it was more apparent that the confidence intervals calculated by this method were slightly tighter than in AskCHIS.
Days per week reading books with child (0-5 years) | Percent | 95% c.i. | Population |
Every day | 63.2% | ( 56.1 - 70.3 ) | 1,852,000 |
3 to 6 days of the week | 19.1% | ( 12.9 - 25.2 ) | 559,000 |
1 to 2 days of the week | 12.8% | ( 7.5 - 18.1 ) | 376,000 |
Never | 4.9% * | ( 1.8 - 8.0 ) | 144,000 |
* unstable
I read a little more and it looks like the author of the survey package created another function to calculate confidence intervals for survey objects. These were a lot closer even though the documentation says this “mean” method is the same as calling confint(svymean())
.
cimethod = "mean"
svyciprop(~I(cg14 == "EVERY DAY"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "EVERY DAY") 0.632 0.561 0.7
svyciprop(~I(cg14 == "3-6 DAYS"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "3-6 DAYS") 0.191 0.129 0.25
svyciprop(~I(cg14 == "1-2 DAYS"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "1-2 DAYS") 0.128 0.075 0.18
svyciprop(~I(cg14 == "NEVER"), design = chis_reading, method = cimethod, level = 0.95)
## 2.5% 97.5%
## I(cg14 == "NEVER") 0.0490 0.0181 0.08
Discussion
I was able to load the CHIS 2018 child dataset and confirm that I could use the survey methods to replicate the results from the official CHIS web interface. I learned that calculating proportion confidence intervals may be more accurate using the survey confidence interval function svyciprop
than the usual confint
method.
I didn’t do too much with the data, but next time I will pick up with doing some plots and regressions.