EM-Algorithm Estimation of the Binary Outcome Misclassification Model

Jointly estimate \(\beta\) and \(\gamma\) parameters from the true outcome and observation mechanisms, respectively, in a binary outcome misclassification model.

Usage

COMBO_EM(
  Ystar,
  x_matrix,
  z_matrix,
  beta_start,
  gamma_start,
  tolerance = 1e-07,
  max_em_iterations = 1500,
  em_method = "squarem"
)

Arguments

Ystar: A numeric vector of indicator variables (1, 2) for the observed outcome Y*. There should be no NA terms. The reference category is 2.
x_matrix: A numeric matrix of covariates in the true outcome mechanism. x_matrix should not contain an intercept and no values should be NA.
z_matrix: A numeric matrix of covariates in the observation mechanism. z_matrix should not contain an intercept and no values should be NA.
beta_start: A numeric vector or column matrix of starting values for the \(\beta\) parameters in the true outcome mechanism. The number of elements in beta_start should be equal to the number of columns of x_matrix plus 1.
gamma_start: A numeric vector or matrix of starting values for the \(\gamma\) parameters in the observation mechanism. In matrix form, the gamma_start matrix rows correspond to parameters for the Y* = 1 observed outcome, with the dimensions of z_matrix plus 1, and the gamma parameter matrix columns correspond to the true outcome categories \(Y \in \{1, 2\}\). A numeric vector for gamma_start is obtained by concatenating the gamma matrix, i.e. gamma_start <- c(gamma_matrix).
tolerance: A numeric value specifying when to stop estimation, based on the difference of subsequent log-likelihood estimates. The default is 1e-7.
max_em_iterations: An integer specifying the maximum number of iterations of the EM algorithm. The default is 1500.
em_method: A character string specifying which EM algorithm will be applied. Options are "em", "squarem", or "pem". The default and recommended option is "squarem".

Value

COMBO_EM returns a data frame containing four columns. The first column, Parameter, represents a unique parameter value for each row. The next column contains the parameter Estimates, followed by the standard error estimates, SE. The final column, Convergence, reports whether or not the algorithm converged for a given parameter estimate.

Estimates are provided for the binary misclassification model, as well as two additional cases. The "SAMBA" parameter estimates are from the R Package, SAMBA, which uses the EM algorithm to estimate a binary outcome misclassification model that assumes there is perfect specificity. The "PSens" parameter estimates are estimated using the EM algorithm for the binary outcome misclassification model that assumes there is perfect sensitivitiy. The "Naive" parameter estimates are from a simple logistic regression Y* ~ X.

References

Beesley, L. and Mukherjee, B. (2020). Statistical inference for association studies using electronic health records: Handling both selection bias and outcome misclassification. Biometrics, 78, 214-226.

Examples

# \donttest{
set.seed(123)
n <- 1000
x_mu <- 0
x_sigma <- 1
z_shape <- 1

true_beta <- matrix(c(1, -2), ncol = 1)
true_gamma <- matrix(c(.5, 1, -.5, -1), nrow = 2, byrow = FALSE)

x_matrix = matrix(rnorm(n, x_mu, x_sigma), ncol = 1)
X = matrix(c(rep(1, n), x_matrix[,1]), ncol = 2, byrow = FALSE)
z_matrix = matrix(rgamma(n, z_shape), ncol = 1)
Z = matrix(c(rep(1, n), z_matrix[,1]), ncol = 2, byrow = FALSE)

exp_xb = exp(X %*% true_beta)
pi_result = exp_xb[,1] / (exp_xb[,1] + 1)
pi_matrix = matrix(c(pi_result, 1 - pi_result), ncol = 2, byrow = FALSE)

true_Y <- rep(NA, n)
for(i in 1:n){
    true_Y[i] = which(stats::rmultinom(1, 1, pi_matrix[i,]) == 1)
}

exp_zg = exp(Z %*% true_gamma)
pistar_denominator = matrix(c(1 + exp_zg[,1], 1 + exp_zg[,2]), ncol = 2, byrow = FALSE)
pistar_result = exp_zg / pistar_denominator

pistar_matrix = matrix(c(pistar_result[,1], 1 - pistar_result[,1],
                         pistar_result[,2], 1 - pistar_result[,2]),
                       ncol = 2, byrow = FALSE)

obs_Y <- rep(NA, n)
for(i in 1:n){
    true_j = true_Y[i]
    obs_Y[i] = which(rmultinom(1, 1,
                     pistar_matrix[c(i, n + i),
                                     true_j]) == 1)
 }

Ystar <- obs_Y

starting_values <- rep(1,6)
beta_start <- matrix(starting_values[1:2], ncol = 1)
gamma_start <- matrix(starting_values[3:6], ncol = 2, nrow = 2, byrow = FALSE)

EM_results <- COMBO_EM(Ystar, x_matrix = x_matrix, z_matrix = z_matrix,
                       beta_start = beta_start, gamma_start = gamma_start)
#> Loading required package: doParallel
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
#> Loading required package: numDeriv
#> Loading required package: quantreg
#> Loading required package: SparseM
#> 
#> Attaching package: 'turboEM'
#> The following objects are masked from 'package:numDeriv':
#> 
#>     grad, hessian

EM_results# }
#>        Parameter    Estimates          SE Convergence
#> 1          beta1   0.94890120  0.27589385        TRUE
#> 2          beta2  -2.35565018  0.16518517        TRUE
#> 3        gamma11   0.53773267  0.17206710        TRUE
#> 4        gamma21   0.94899526  0.21022199        TRUE
#> 5        gamma12  -0.03143942  0.12881774        TRUE
#> 6        gamma22  -1.42142132  0.25802541        TRUE
#> 7    SAMBA_beta1   1.02579269  0.29776853          NA
#> 8    SAMBA_beta2  -1.13762209  0.20119894          NA
#> 9  SAMBA_gamma11   1.22380033  0.30354598          NA
#> 10 SAMBA_gamma21   0.57807251  0.35267049          NA
#> 11   PSens_beta1   0.36958220  7.02366053          NA
#> 12   PSens_beta2  -0.72231505  2.81551719          NA
#> 13 PSens_gamma12  -0.75532783  3.28479253          NA
#> 14 PSens_gamma22 -43.39974465 22.14360437          NA
#> 15   naive_beta1   0.38315779  0.06811635        TRUE
#> 16   naive_beta2  -0.71541393  0.07536077        TRUE