EM-Algorithm Estimation of the Two-Stage Binary Outcome Misclassification Model

Jointly estimate \(\beta\), \(\gamma^{(1)}\), \(\gamma^{(2)}\) parameters from the true outcome, first-stage observation, and second-stage observation mechanisms, respectively, in a two-stage binary outcome misclassification model.

Usage

COMBO_EM_2stage(
  Ystar1,
  Ystar2,
  x_matrix,
  z1_matrix,
  z2_matrix,
  beta_start,
  gamma1_start,
  gamma2_start,
  tolerance = 1e-07,
  max_em_iterations = 1500,
  em_method = "squarem"
)

Arguments

Ystar1: A numeric vector of indicator variables (1, 2) for the first-stage observed outcome \(Y^{*(1)}\). There should be no NA terms. The reference category is 2.
Ystar2: A numeric vector of indicator variables (1, 2) for the second-stage observed outcome \(Y^{*(2)}\). There should be no NA terms. The reference category is 2.
x_matrix: A numeric matrix of covariates in the true outcome mechanism. x_matrix should not contain an intercept and no values should be NA.
z1_matrix: A numeric matrix of covariates in the first-stage observation mechanism. z1_matrix should not contain an intercept and no values should be NA.
z2_matrix: A numeric matrix of covariates in the second-stage observation mechanism. z2_matrix should not contain an intercept and no values should be NA.
beta_start: A numeric vector or column matrix of starting values for the \(\beta\) parameters in the true outcome mechanism. The number of elements in beta_start should be equal to the number of columns of x_matrix plus 1.
gamma1_start: A numeric vector or matrix of starting values for the \(\gamma^{(1)}\) parameters in the first-stage observation mechanism. In matrix form, the gamma1_start matrix rows correspond to parameters for the \(Y^{*(1)} = 1\) first-stage observed outcome, with the dimensions of z1_matrix plus 1, and the parameter matrix columns correspond to the true outcome categories \(Y \in \{1, 2\}\). A numeric vector for gamma1_start is obtained by concatenating the matrix, i.e. gamma1_start <- c(gamma1_matrix).
gamma2_start: A numeric array of starting values for the \(\gamma^{(2)}\) parameters in the second-stage observation mechanism. The first dimension (matrix rows) of gamma2_start correspond to parameters for the \(Y^{*(2)} = 1\) second-stage observed outcome, with the dimensions of the z2_matrix plus 1. The second dimension (matrix columns) correspond to the first-stage observed outcome categories \(Y^{*(1)} \in \{1, 2\}\). The third dimension of gamma2_start corresponds to to the true outcome categories \(Y \in \{1, 2\}\).
tolerance: A numeric value specifying when to stop estimation, based on the difference of subsequent log-likelihood estimates. The default is 1e-7.
max_em_iterations: An integer specifying the maximum number of iterations of the EM algorithm. The default is 1500.
em_method: A character string specifying which EM algorithm will be applied. Options are "em", "squarem", or "pem". The default and recommended option is "squarem".

Value

COMBO_EM_2stage returns a data frame containing four columns. The first column, Parameter, represents a unique parameter value for each row. The next column contains the parameter Estimates, followed by the standard error estimates, SE. The final column, Convergence, reports whether or not the algorithm converged for a given parameter estimate.

Estimates are provided for the two-stage binary misclassification model.

Examples

# \donttest{
set.seed(123)
n <- 1000
x_mu <- 0
x_sigma <- 1
z1_shape <- 1
z2_shape <- 1

true_beta <- matrix(c(1, -2), ncol = 1)
true_gamma1 <- matrix(c(.5, 1, -.5, -1), nrow = 2, byrow = FALSE)
true_gamma2 <- array(c(1.5, 1, .5, .5, -.5, 0, -1, -1), dim = c(2, 2, 2))

my_data <- COMBO_data_2stage(sample_size = n,
                             x_mu = x_mu, x_sigma = x_sigma,
                             z1_shape = z1_shape, z2_shape = z2_shape,
                             beta = true_beta, gamma1 = true_gamma1, gamma2 = true_gamma2)
table(my_data[["obs_Ystar2"]], my_data[["obs_Ystar1"]], my_data[["true_Y"]])
#> , ,  = 1
#> 
#>    
#>       1   2
#>   1 457 113
#>   2  51  38
#> 
#> , ,  = 2
#> 
#>    
#>       1   2
#>   1  30  40
#>   2  39 232
#> 

beta_start <- rnorm(length(c(true_beta)))
gamma1_start <- rnorm(length(c(true_gamma1)))
gamma2_start <- rnorm(length(c(true_gamma2)))

EM_results <- COMBO_EM_2stage(Ystar1 = my_data[["obs_Ystar1"]],
                              Ystar2 = my_data[["obs_Ystar2"]],
                              x_matrix = my_data[["x"]],
                              z1_matrix = my_data[["z1"]],
                              z2_matrix = my_data[["z2"]],
                              beta_start = beta_start,
                              gamma1_start = gamma1_start,
                              gamma2_start = gamma2_start)

EM_results# }
#>          Parameter   Estimates          SE Convergence
#> 1           beta_1  1.48576087   0.3238284        TRUE
#> 2           beta_2 -2.32657028   0.4682280        TRUE
#> 3        gamma1_11  0.32115176   0.1672771        TRUE
#> 4        gamma1_21  1.18406349   0.2141527        TRUE
#> 5        gamma1_12 -1.16368178   0.3454245        TRUE
#> 6        gamma1_22 -0.76542933   0.2742785        TRUE
#> 7      gamma2_1111  1.09495809   0.2177926        TRUE
#> 8      gamma2_2111  1.41645126   0.3075869        TRUE
#> 9      gamma2_1121  0.16752733   0.6064557        TRUE
#> 10     gamma2_2121  1.11817996   0.3890782        TRUE
#> 11     gamma2_1112 -1.08983781   1.5149193        TRUE
#> 12     gamma2_2112  0.13234707   0.9157872        TRUE
#> 13     gamma2_1122 -0.93782348   0.3907942        TRUE
#> 14     gamma2_2122 -1.45453541   0.4430070        TRUE
#> 15    naive_beta_1  0.06898792   5.3173399        TRUE
#> 16    naive_beta_2  0.07839931 -10.4633987        TRUE
#> 17 naive_gamma2_11  0.17418532   5.8189020        TRUE
#> 18 naive_gamma2_21  0.19302573   4.2266253        TRUE
#> 19 naive_gamma2_12  0.14501424  -3.9715920        TRUE
#> 20 naive_gamma2_22  0.10691412   0.0766478        TRUE