In this document, we walk through the LCA warmup again, but we break down the code input and output step-by-step.

library(tidyverse)
library(poLCA)

Load data.

week9 <- read.csv("week10-practice.csv", header=T, skip=3)
head(week9)
##      id    team manager jobTasksSDSA jobBalanceSDSA jobPaySDSA
## 1 W472V  Ethics  X8022Z            7              3          7
## 2 S227N     DSI  V8472W            1              6          3
## 3 N333E Finance  X7645X            7              7          7
## 4 K503P Finance  Y4468Y            2              2          4
## 5 J845U     R&D  X8066X            6              3          7
## 6 O834K Finance  Z7298V            3              3          4
##   jobRecognitionSDSA jobAdvancementSDSA jobCoworkersSDSA jobManagementSDSA
## 1                  7                  7                5                 4
## 2                  2                  2                7                 7
## 3                  7                  7                7                NA
## 4                  3                  4                3                 3
## 5                  7                  7                4                 4
## 6                  4                  4                4                 3
##   salary tenure age genderFM degreeBMP citizenNY
## 1 126460      6  30        2         3         2
## 2  47749      1  38        1         1         1
## 3  80828      6  24        2         3         1
## 4  52852      6  35        1         1         2
## 5  98429     10  46        2         3         2
## 6  45397      1  50        1         1         2
str(week9)
## 'data.frame':    1384 obs. of  16 variables:
##  $ id                : Factor w/ 1384 levels "A103R","A105D",..: 1208 975 708 549 509 792 972 637 235 311 ...
##  $ team              : Factor w/ 5 levels "DSI","Ethics",..: 2 1 3 3 5 3 5 5 5 3 ...
##  $ manager           : Factor w/ 18 levels "I511L","V4497X",..: 10 4 9 13 11 18 16 8 8 15 ...
##  $ jobTasksSDSA      : int  7 1 7 2 6 3 2 7 3 4 ...
##  $ jobBalanceSDSA    : int  3 6 7 2 3 3 3 4 6 5 ...
##  $ jobPaySDSA        : int  7 3 7 4 7 4 3 7 4 5 ...
##  $ jobRecognitionSDSA: int  7 2 7 3 7 4 4 7 3 5 ...
##  $ jobAdvancementSDSA: int  7 2 7 4 7 4 3 7 4 6 ...
##  $ jobCoworkersSDSA  : int  5 7 7 3 4 4 4 5 7 7 ...
##  $ jobManagementSDSA : int  4 7 NA 3 4 3 5 6 7 6 ...
##  $ salary            : int  126460 47749 80828 52852 98429 45397 171648 53330 145440 71567 ...
##  $ tenure            : int  6 1 6 6 10 1 8 7 2 8 ...
##  $ age               : int  30 38 24 35 46 50 34 53 22 42 ...
##  $ genderFM          : int  2 1 2 1 2 1 2 2 2 2 ...
##  $ degreeBMP         : int  3 1 3 1 3 1 3 1 3 2 ...
##  $ citizenNY         : int  2 1 1 2 2 2 2 2 1 2 ...

We want to determine if there is a grouping in the data that accounts for the relationship between gender, education, and citizenship.

The poLCA package

The poLCA package was introduced in the walkthrough, but I was asked for a bit more clarity on its usage.

The framework behind LCA is taking a group of multivariate dependent varaibles that are categorical, such as survey responses, and accounting for their correlation by latent class membership.

poLCA requires a formula specification with two elements: a list of dependent variables, which would the multiple outcomes/survey items, and a list of independent variables, which are predictors of class membership. The dependent variables require a cbind, but the independent variables do not.

Important: predictors for class membership are not required. If you just want to determine latent classes for a set of variables without a predictor, then simply use a 1 on the right side of the ~. See comments below.

For our hypothesis, we have:

## for ease, write formula into object 
## form:
## dependent variables are on the lefthand side, and need cbind()
## the formula is specified by a tilde, ~ 
## the predictors of class membership are on the righthand side, normal formula specification (iv1 + iv2, etc) 
## if '~ 1', then there is no predictor o n class 
unconditionalLCA <- cbind(genderFM, degreeBMP, citizenNY) ~ 1

In addition to the formula, there are many options when running poLCA, see the code with comments below.

set.seed(105)
unconditional2class <- poLCA(unconditionalLCA, # the formula for the model, 
                             week9, # the data
                             nclass =2,  # number of classes to fit
                             maxiter = 10000, #max iterations
                             tol = 1e-8, # this is the degree of improvement from one iteration to the next to determine if the estimation converged. 1e-10 is very conservative, can relax to 1e-6. 
                             nrep = 1, #number of times to carry out the iteration procedure. complex models sometimes need multiple reps to increase confidence in the solution 
                             verbose = F # I don't want all the output for this document!
)

unconditional2class
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$genderFM
           Pr(1)  Pr(2)
class 1:  0.2009 0.7991
class 2:  0.8026 0.1974

$degreeBMP
           Pr(1)  Pr(2)  Pr(3)
class 1:  0.1412 0.1875 0.6713
class 2:  0.5930 0.4041 0.0029

$citizenNY
           Pr(1)  Pr(2)
class 1:  0.2085 0.7915
class 2:  0.2018 0.7982

Estimated class population shares 
 0.5113 0.4887 
 
Predicted class memberships (by modal posterior prob.) 
 0.4841 0.5159 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 1384 
number of estimated parameters: 9 
residual degrees of freedom: 2 
maximum log-likelihood: -3043.681 
 
AIC(2): 6105.362
BIC(2): 6152.457
G^2(2): 0.2288443 (Likelihood ratio/deviance statistic) 
X^2(2): 0.2285847 (Chi-square goodness of fit) 
 
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
 

The above was calculated very quickly, but the maximum likelihood was not found. This means that the estimation did not converge. I will increase the iterations to see if a solution is reached.

set.seed(105)
unconditional2class <- poLCA(unconditionalLCA, # the formula for the model, 
                             week9, # the data
                             nclass =2,  # number of classes to fit
                             maxiter = 15000, #max iterations
                             tol = 1e-10, # this is the degree of improvement from one iteration to the next to determine if the estimation converged. 1e-10 is very conservative, can relax to 1e-6. 
                             nrep = 1, #number of times to carry out the iteration procedure. complex models sometimes need multiple reps to increase confidence in the solution 
                             verbose = F # I don't want all the output for this document!
)

unconditional2class
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$genderFM
           Pr(1)  Pr(2)
class 1:  0.2025 0.7975
class 2:  0.7886 0.2114

$degreeBMP
           Pr(1)  Pr(2)  Pr(3)
class 1:  0.1307 0.1824 0.6869
class 2:  0.5943 0.4048 0.0009

$citizenNY
           Pr(1)  Pr(2)
class 1:  0.2086 0.7914
class 2:  0.2018 0.7982

Estimated class population shares 
 0.5011 0.4989 
 
Predicted class memberships (by modal posterior prob.) 
 0.4603 0.5397 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 1384 
number of estimated parameters: 9 
residual degrees of freedom: 2 
maximum log-likelihood: -3043.68 
 
AIC(2): 6105.361
BIC(2): 6152.455
G^2(2): 0.2273009 (Likelihood ratio/deviance statistic) 
X^2(2): 0.2270325 (Chi-square goodness of fit) 
 
ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND 
 

The estimation still did not converge. When a model does not converge, we usually don’t have much, if any, confidence in interpreting the results. However, look at the above two outputs. Notice anything strange about the pattern of results?

One limitation of LCA is that, due to the uncertainty of the estimation procedure, the solution can change even when you analyze the same data twice. For this reason, you should set a seed before you run the analyses, and also realize that increasing iterations or reducing the tolerance can lead to different solutions. Sometimes these differences are slight, but other times they can be drastic.

plot(unconditional2class)

For these data, genderFM = 1 indicates female, 2 indicates male; degree is coded 1 = bachelors, 2 = masters, 3 = PhD; and citizen is coded 1 = no and 2 = yes. Everyone seemed to have a good handle on interpreting these plots.

Notice that for both classes, the pattern of non-citizens is pretty similar. If citizenship doesn’t change across classes, then class membership doesn’t really do much in terms of dividing citizens and non-citizens. This makes sense if we do some EDA and notice that citizenship is completely uncorrelated with gender and degree:

week9 %>% dplyr::select(genderFM, degreeBMP, citizenNY) %>% cor(.)
              genderFM    degreeBMP    citizenNY
genderFM   1.000000000  0.400856831 -0.002016765
degreeBMP  0.400856831  1.000000000 -0.001968835
citizenNY -0.002016765 -0.001968835  1.000000000

Let’s try to reach convergence by relaxing the tolerance and increasing the iterations one more time:

set.seed(105)
unconditional2class_b <- poLCA(unconditionalLCA, # the formula for the model, 
                             week9, # the data
                             nclass =2,  # number of classes to fit
                             maxiter = 20000, #max iterations
                             tol = 1e-6, # this is the degree of improvement from one iteration to the next to determine if the estimation converged. 1e-10 is very conservative, can relax to 1e-6. 
                             nrep = 1, #number of times to carry out the iteration procedure. complex models sometimes need multiple reps to increase confidence in the solution 
                             verbose = F # I don't want all the output for this document!
)

unconditional2class_b
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$genderFM
           Pr(1)  Pr(2)
class 1:  0.1861 0.8139
class 2:  0.8381 0.1619

$degreeBMP
          Pr(1)  Pr(2)  Pr(3)
class 1:  0.164 0.1984 0.6375
class 2:  0.582 0.3988 0.0192

$citizenNY
           Pr(1)  Pr(2)
class 1:  0.2082 0.7918
class 2:  0.2019 0.7981

Estimated class population shares 
 0.5263 0.4737 
 
Predicted class memberships (by modal posterior prob.) 
 0.5751 0.4249 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
number of observations: 1384 
number of estimated parameters: 9 
residual degrees of freedom: 2 
maximum log-likelihood: -3043.685 
 
AIC(2): 6105.37
BIC(2): 6152.465
G^2(2): 0.2370587 (Likelihood ratio/deviance statistic) 
X^2(2): 0.2368331 (Chi-square goodness of fit) 
 
unconditional2class_b$numiter
[1] 1885

Relaxing tolerance led to convergence. That just means that we permitted a smaller degree of improvement in estimation from one iteration to the next to allow the process to stop. I am ok with 0.000001 improvement in likelihood as a convergence threshold.

Let’s try 3 classes. We aren’t changing anything in the model, so our formula stays the same.

set.seed(559)
unconditional3class <- poLCA(unconditionalLCA, # the formula for the model, 
                             week9, # the data
                             nclass =3,  # number of classes to fit
                             maxiter = 10000, #max iterations
                             tol = 1e-6, # this is the degree of improvement from one iteration to the next to determine if the estimation converged. 1e-10 is very conservative, can relax to 1e-6. 
                             nrep = 1, #number of times to carry out the iteration procedure. complex models sometimes need multiple reps to increase confidence in the solution 
                             verbose = F # I don't want all the output for this document!
)

unconditional3class
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$genderFM
           Pr(1)  Pr(2)
class 1:  0.1077 0.8923
class 2:  0.7526 0.2474
class 3:  0.8085 0.1915

$degreeBMP
           Pr(1)  Pr(2)  Pr(3)
class 1:  0.1318 0.1683 0.6999
class 2:  0.4042 0.5937 0.0021
class 3:  0.6534 0.1892 0.1574

$citizenNY
           Pr(1)  Pr(2)
class 1:  0.2077 0.7923
class 2:  0.1842 0.8158
class 3:  0.2214 0.7786

Estimated class population shares 
 0.4252 0.2794 0.2954 
 
Predicted class memberships (by modal posterior prob.) 
 0.5051 0.1777 0.3172 
 
========================================================= 
Fit for 3 latent classes: 
========================================================= 
number of observations: 1384 
number of estimated parameters: 14 
residual degrees of freedom: -3 
maximum log-likelihood: -3043.569 
 
AIC(3): 6115.138
BIC(3): 6188.396
G^2(3): 0.004498952 (Likelihood ratio/deviance statistic) 
X^2(3): 0.00450203 (Chi-square goodness of fit) 
 
ALERT: negative degrees of freedom; respecify model 
 

We see at the bottom of the output an alert - negative degrees of freedom. This is because we don’t have enough information to estimate the usual LCA with 3 classes in these data. We have 3 variables, with 2 x 3 x 2 categories, equally 12 unique response combinations. If we know 11 of the combinations, we can deduce the 12th, so we have 11 degrees of freedom. In the three class model, we estimate 1 + 2 + 1 response thresholds for each of 3 classes, plus 2 class probability thresholds. That’s 14 estimated parameters, which is not possible for these data. I would not interpret this output because the model is impossible - we can’t be certain about any solution that the software settles on.

For completeness, let’s look at an LCA with age as a predictor:

conditionalLCA <- cbind(genderFM, degreeBMP, citizenNY) ~ age

set.seed(9872)
conditional2class <- poLCA(conditionalLCA, # the formula for the model, 
                             week9, # the data
                             nclass =2,  # number of classes to fit
                             maxiter = 10000, #max iterations
                             tol = 1e-6, # this is the degree of improvement from one iteration to the next to determine if the estimation converged. 1e-10 is very conservative, can relax to 1e-6. 
                             nrep = 1, #number of times to carry out the iteration procedure. complex models sometimes need multiple reps to increase confidence in the solution 
                             verbose = F # I don't want all the output for this document!
)

conditional2class
Conditional item response (column) probabilities,
 by outcome variable, for each class (row) 
 
$genderFM
           Pr(1)  Pr(2)
class 1:  0.8350 0.1650
class 2:  0.1998 0.8002

$degreeBMP
           Pr(1)  Pr(2)  Pr(3)
class 1:  0.5921 0.4037 0.0042
class 2:  0.1622 0.1976 0.6402

$citizenNY
           Pr(1)  Pr(2)
class 1:  0.2017 0.7983
class 2:  0.2082 0.7918

Estimated class population shares 
 0.4647 0.5353 
 
Predicted class memberships (by modal posterior prob.) 
 0.4249 0.5751 
 
========================================================= 
Fit for 2 latent classes: 
========================================================= 
2 / 1 
            Coefficient  Std. error  t value  Pr(>|t|)
(Intercept)     0.13318     0.24007    0.555     0.678
age             0.00022     0.00665    0.033     0.979
========================================================= 
number of observations: 1384 
number of estimated parameters: 10 
residual degrees of freedom: 1 
maximum log-likelihood: -3043.682 
 
AIC(2): 6107.364
BIC(2): 6159.692
X^2(2): 0.2317106 (Chi-square goodness of fit) 
 
conditional2class$numiter
[1] 23
plot(conditional2class)

Age has no effect in predicting class membership, so we don’t need it moving forward.

Interpretation

We have two classes, women with bachelors degrees and men with higher degrees. There are 20% non-citizens for both classes, so we learn nothing new about citizenship.

This is just a toy analysis with only a few variables. An analysis of survey response items (e.g., risky behaviors, customer preferences, disease diagnoses, depressive symptoms) would be more interesting and informative. The conclusion I would draw from this analysis is that for these data, we should control for gender and education in any future analyses, because there is variability in the data accounted for on those two dimensions.