
EM-Algorithm Estimation of the Binary Outcome Misclassification Model
COMBO_EM.Rd
Jointly estimate \(\beta\) and \(\gamma\) parameters from the true outcome and observation mechanisms, respectively, in a binary outcome misclassification model.
Usage
COMBO_EM(
Ystar,
x_matrix,
z_matrix,
beta_start,
gamma_start,
tolerance = 1e-07,
max_em_iterations = 1500,
em_method = "squarem"
)
Arguments
- Ystar
A numeric vector of indicator variables (1, 2) for the observed outcome
Y*
. There should be noNA
terms. The reference category is 2.- x_matrix
A numeric matrix of covariates in the true outcome mechanism.
x_matrix
should not contain an intercept and no values should beNA
.- z_matrix
A numeric matrix of covariates in the observation mechanism.
z_matrix
should not contain an intercept and no values should beNA
.- beta_start
A numeric vector or column matrix of starting values for the \(\beta\) parameters in the true outcome mechanism. The number of elements in
beta_start
should be equal to the number of columns ofx_matrix
plus 1.- gamma_start
A numeric vector or matrix of starting values for the \(\gamma\) parameters in the observation mechanism. In matrix form, the
gamma_start
matrix rows correspond to parameters for theY* = 1
observed outcome, with the dimensions ofz_matrix
plus 1, and the gamma parameter matrix columns correspond to the true outcome categories \(Y \in \{1, 2\}\). A numeric vector forgamma_start
is obtained by concatenating the gamma matrix, i.e.gamma_start <- c(gamma_matrix)
.- tolerance
A numeric value specifying when to stop estimation, based on the difference of subsequent log-likelihood estimates. The default is
1e-7
.- max_em_iterations
An integer specifying the maximum number of iterations of the EM algorithm. The default is
1500
.- em_method
A character string specifying which EM algorithm will be applied. Options are
"em"
,"squarem"
, or"pem"
. The default and recommended option is"squarem"
.
Value
COMBO_EM
returns a data frame containing four columns. The first
column, Parameter
, represents a unique parameter value for each row.
The next column contains the parameter Estimates
, followed by the standard
error estimates, SE
. The final column, Convergence
, reports
whether or not the algorithm converged for a given parameter estimate.
Estimates are provided for the binary misclassification model, as well as two
additional cases. The "SAMBA" parameter estimates are from the R Package,
SAMBA, which uses the EM algorithm to estimate a binary outcome misclassification
model that assumes there is perfect specificity. The "PSens" parameter estimates
are estimated using the EM algorithm for the binary outcome misclassification
model that assumes there is perfect sensitivitiy. The "Naive" parameter
estimates are from a simple logistic regression Y* ~ X
.
References
Beesley, L. and Mukherjee, B. (2020). Statistical inference for association studies using electronic health records: Handling both selection bias and outcome misclassification. Biometrics, 78, 214-226.
Examples
# \donttest{
set.seed(123)
n <- 1000
x_mu <- 0
x_sigma <- 1
z_shape <- 1
true_beta <- matrix(c(1, -2), ncol = 1)
true_gamma <- matrix(c(.5, 1, -.5, -1), nrow = 2, byrow = FALSE)
x_matrix = matrix(rnorm(n, x_mu, x_sigma), ncol = 1)
X = matrix(c(rep(1, n), x_matrix[,1]), ncol = 2, byrow = FALSE)
z_matrix = matrix(rgamma(n, z_shape), ncol = 1)
Z = matrix(c(rep(1, n), z_matrix[,1]), ncol = 2, byrow = FALSE)
exp_xb = exp(X %*% true_beta)
pi_result = exp_xb[,1] / (exp_xb[,1] + 1)
pi_matrix = matrix(c(pi_result, 1 - pi_result), ncol = 2, byrow = FALSE)
true_Y <- rep(NA, n)
for(i in 1:n){
true_Y[i] = which(stats::rmultinom(1, 1, pi_matrix[i,]) == 1)
}
exp_zg = exp(Z %*% true_gamma)
pistar_denominator = matrix(c(1 + exp_zg[,1], 1 + exp_zg[,2]), ncol = 2, byrow = FALSE)
pistar_result = exp_zg / pistar_denominator
pistar_matrix = matrix(c(pistar_result[,1], 1 - pistar_result[,1],
pistar_result[,2], 1 - pistar_result[,2]),
ncol = 2, byrow = FALSE)
obs_Y <- rep(NA, n)
for(i in 1:n){
true_j = true_Y[i]
obs_Y[i] = which(rmultinom(1, 1,
pistar_matrix[c(i, n + i),
true_j]) == 1)
}
Ystar <- obs_Y
starting_values <- rep(1,6)
beta_start <- matrix(starting_values[1:2], ncol = 1)
gamma_start <- matrix(starting_values[3:6], ncol = 2, nrow = 2, byrow = FALSE)
EM_results <- COMBO_EM(Ystar, x_matrix = x_matrix, z_matrix = z_matrix,
beta_start = beta_start, gamma_start = gamma_start)
#> Loading required package: doParallel
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel
#> Loading required package: numDeriv
#> Loading required package: quantreg
#> Loading required package: SparseM
#>
#> Attaching package: 'turboEM'
#> The following objects are masked from 'package:numDeriv':
#>
#> grad, hessian
EM_results# }
#> Parameter Estimates SE Convergence
#> 1 beta1 0.94890120 0.27589385 TRUE
#> 2 beta2 -2.35565018 0.16518517 TRUE
#> 3 gamma11 0.53773267 0.17206710 TRUE
#> 4 gamma21 0.94899526 0.21022199 TRUE
#> 5 gamma12 -0.03143942 0.12881774 TRUE
#> 6 gamma22 -1.42142132 0.25802541 TRUE
#> 7 SAMBA_beta1 1.02579269 0.29776853 NA
#> 8 SAMBA_beta2 -1.13762209 0.20119894 NA
#> 9 SAMBA_gamma11 1.22380033 0.30354598 NA
#> 10 SAMBA_gamma21 0.57807251 0.35267049 NA
#> 11 PSens_beta1 0.36958220 7.02366053 NA
#> 12 PSens_beta2 -0.72231505 2.81551719 NA
#> 13 PSens_gamma12 -0.75532783 3.28479253 NA
#> 14 PSens_gamma22 -43.39974465 22.14360437 NA
#> 15 naive_beta1 0.38315779 0.06811635 TRUE
#> 16 naive_beta2 -0.71541393 0.07536077 TRUE