Generate correlation matrices with complex survey data in R

The survey package is one of R’s best tools for those working in the social sciences. For many, it saves you from needing to use commercial software for research that uses survey data. However, it lacks one function that many academic researchers often need to report in publications: correlations. The svycor function in jtools (more info) helps to fill that gap.

An initial note, however, is necessary. The basic method behind this feature comes from a response to a question about calculating correlations with the survey package written by Thomas Lumley, the survey package author—he has not seen (to my knowledge) or endorsed this function. All that is good about this function should be attributed to Dr. Lumley; all that is wrong with it should be attributed to me (Jacob).

With that said, let’s look at an example. First, we need to get a survey.design object. This one is built into the survey package.

library(survey)
data(api)
dstrat <- svydesign(id = ~1,strata = ~stype, weights = ~pw, data = apistrat, fpc=~fpc)

Basic use

The necessary arguments are no different than when using svyvar. Specify, using an equation, which variables (and from which design) to include. It doesn’t matter which side of the equation the variables are on.

svycor(~api00 + api99, design = dstrat)
       api00 api99
 api00  1.00  0.98
 api99  0.98  1.00

You can specify with the digits = argument how many digits past the decimal point should be printed.

svycor(~api00 + api99, design = dstrat, digits = 4)
        api00  api99
 api00 1.0000 0.9759
 api99 0.9759 1.0000

Any other arguments that you would normally pass to svyvar will be used as well, though in some cases it may not affect the output.

Statistical significance tests

One thing that survey won’t do for you is give you p values for the null hypothesis that r = 0. While at first blush finding the p value might seem like a simple procedure, complex surveys will almost always violate the important distributional assumptions that go along with simple hypothesis tests of the correlation coefficient. There is not a clear consensus on the appropriate way to conduct hypothesis tests in this context, due in part to the fact that most analyses of complex surveys occurs in the context of multiple regression rather than simple bivariate cases.

If sig.stats = TRUE, then svycor will use the wtd.cor function from the weights package to conduct hypothesis tests. The p values are derived from a bootstrap procedure in which the weights define sampling probability. The bootn = argument is given to wtd.cor to define the number of simulations to run. This can significantly increase the running time for large samples and/or large numbers of bootstraps. The mean1 argument tells wtd.cor whether it should treat your sample size as the number of observations in the survey design (the number of rows in the data frame) or the sum of the weights. Usually, the former is desired, so the default value of mean1 is TRUE.

svycor(~api00 + api99, design = dstrat, digits = 4, sig.stats = TRUE, bootn = 2000, mean1 = TRUE)
       api00   api99
 api00 1       0.9759*
 api99 0.9759* 1

When using sig.stats = TRUE, the correlation parameter estimates come from the bootstrap procedure rather than the simpler method based on the survey-weighted covariance matrix when sig.stats = FALSE.

By saving the output of the function, you can extract non-rounded coefficients, p values, and standard errors.

c <- svycor(~api00 + api99, design = dstrat, digits = 4, sig.stats = TRUE, bootn = 2000, mean1 = TRUE)

c$cors
           api00     api99
 api00 1.0000000 0.9759047
 api99 0.9759047 1.0000000
c$p.values
       api00 api99
 api00     0     0
 api99     0     0
c$std.err
             api00       api99
 api00 0.000000000 0.003467371
 api99 0.003467371 0.000000000

Technical details

The heavy lifting behind the scenes is done by svyvar, which from its output you may not realize also calculates covariance.

svyvar(~api00 + api99, design = dstrat)
       variance     SE
 api00    15191 1255.7
 api99    16518 1318.4

But if you save the svyvar object, you can see that there’s more than meets the eye.

var <- svyvar(~api00 + api99, design = dstrat)
var <- as.matrix(var)
var
          api00    api99
 api00 15190.59 15458.83
 api99 15458.83 16518.24
 attr(,"var")
         api00   api00   api99   api99
 api00 1576883 1580654 1580654 1561998
 api00 1580654 1630856 1630856 1657352
 api99 1580654 1630856 1630856 1657352
 api99 1561998 1657352 1657352 1738266
 attr(,"statistic")
 [1] "variance"

Once we know that, it’s just a matter of using R’s cov2cor function and cleaning up the output.

cor <- cov2cor(var)
cor
           api00     api99
 api00 1.0000000 0.9759047
 api99 0.9759047 1.0000000
 attr(,"var")
         api00   api00   api99   api99
 api00 1576883 1580654 1580654 1561998
 api00 1580654 1630856 1630856 1657352
 api99 1580654 1630856 1630856 1657352
 api99 1561998 1657352 1657352 1738266
 attr(,"statistic")
 [1] "variance"

Now to get rid of that covariance matrix…

cor <- cor[1:nrow(cor), 1:nrow(cor)]
cor
           api00     api99
 api00 1.0000000 0.9759047
 api99 0.9759047 1.0000000

svycor has its own print method, so you won’t see so many digits past the decimal point. You can extract the un-rounded matrix, however.

out <- svycor(~api99 + api00, design = dstrat)
out$cors
           api99     api00
 api99 1.0000000 0.9759047
 api00 0.9759047 1.0000000

Suggestions welcome

Whether you see a problem with the method, found a bug, or have a suggestion for an enhancement, I’d like to hear it. Bug reports on Github are probably the best way to do that, but I’ll keep an eye out for comments here as well.

Jacob A. Long
Jacob A. Long
Assistant Professor of Mass Communications