MCMC Estimation of the Binary Outcome Misclassification Model

Jointly estimate \(\beta\) and \(\gamma\) parameters from the true outcome and observation mechanisms, respectively, in a binary outcome misclassification model.

Usage

COMBO_MCMC(
  Ystar,
  x_matrix,
  z_matrix,
  prior,
  beta_prior_parameters,
  gamma_prior_parameters,
  number_MCMC_chains = 4,
  MCMC_sample = 2000,
  burn_in = 1000,
  display_progress = TRUE
)

Arguments

Ystar: A numeric vector of indicator variables (1, 2) for the observed outcome Y*. The reference category is 2.
x_matrix: A numeric matrix of covariates in the true outcome mechanism. x_matrix should not contain an intercept.
z_matrix: A numeric matrix of covariates in the observation mechanism. z_matrix should not contain an intercept.
prior: A character string specifying the prior distribution for the \(\beta\) and \(\gamma\) parameters. Options are "t", "uniform", "normal", or "dexp" (double Exponential, or Weibull).
beta_prior_parameters: A numeric list of prior distribution parameters for the \(\beta\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the first element of the list should contain a matrix of location, lower bound, mean, or shape parameters, respectively, for \(\beta\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the second element of the list should contain a matrix of shape, upper bound, standard deviation, or scale parameters, respectively, for \(\beta\) terms. For prior distribution "t", the third element of the list should contain a matrix of the degrees of freedom for \(\beta\) terms. The third list element should be empty for all other prior distributions. All matrices in the list should have dimensions n_cat X dim_x, and all elements in the n_cat row should be set to NA.
gamma_prior_parameters: A numeric list of prior distribution parameters for the \(\gamma\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the first element of the list should contain an array of location, lower bound, mean, or shape parameters, respectively, for \(\gamma\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the second element of the list should contain an array of shape, upper bound, standard deviation, or scale parameters, respectively, for \(\gamma\) terms. For prior distribution "t", the third element of the list should contain an array of the degrees of freedom for \(\gamma\) terms. The third list element should be empty for all other prior distributions. All arrays in the list should have dimensions n_cat X n_cat X dim_z, and all elements in the n_cat row should be set to NA.
number_MCMC_chains: An integer specifying the number of MCMC chains to compute. The default is 4.
MCMC_sample: An integer specifying the number of MCMC samples to draw. The default is 2000.
burn_in: An integer specifying the number of MCMC samples to discard for the burn-in period. The default is 1000.
display_progress: A logical value specifying whether messages should be displayed during model compilation. The default is TRUE.

Value

COMBO_MCMC returns a list of the posterior samples and posterior means for both the binary outcome misclassification model and a naive logistic regression of the observed outcome, Y*, predicted by the matrix x. The list contains the following components:

posterior_sample_df: A data frame containing three columns. The first column indicates the chain from which a sample is taken, from 1 to number_MCMC_chains. The second column specifies the parameter associated with a given row. \(\beta\) terms have dimensions dim_x X n_cat. The \(\gamma\) terms have dimensions n_cat X n_cat X dim_z, where the first index specifies the observed outcome category and the second index specifies the true outcome category. The final column provides the MCMC sample.
posterior_means_df: A data frame containing three columns. The first column specifies the parameter associated with a given row. Parameters are indexed as in the posterior_sample_df. The second column provides the posterior mean computed across all chains and all samples. The final column provides the posterior median computed across all chains and all samples.
naive_posterior_sample_df: A data frame containing three columns. The first column indicates the chain from which a sample is taken, from 1 to number_MCMC_chains. The second column specifies the parameter associated with a given row. Naive \(\beta\) terms have dimensions dim_x X n_cat. The final column provides the MCMC sample.
naive_posterior_means_df: A data frame containing three columns. The first column specifies the naive parameter associated with a given row. Parameters are indexed as in the naive_posterior_sample_df. The second column provides the posterior mean computed across all chains and all samples. The final column provides the posterior median computed across all chains and all samples.

Examples

# \donttest{
set.seed(123)
n <- 1000
x_mu <- 0
x_sigma <- 1
z_shape <- 1

true_beta <- matrix(c(1, -2), ncol = 1)
true_gamma <- matrix(c(.5, 1, -.5, -1), nrow = 2, byrow = FALSE)

x_matrix = matrix(rnorm(n, x_mu, x_sigma), ncol = 1)
X = matrix(c(rep(1, n), x_matrix[,1]), ncol = 2, byrow = FALSE)
z_matrix = matrix(rgamma(n, z_shape), ncol = 1)
Z = matrix(c(rep(1, n), z_matrix[,1]), ncol = 2, byrow = FALSE)

exp_xb = exp(X %*% true_beta)
pi_result = exp_xb[,1] / (exp_xb[,1] + 1)
pi_matrix = matrix(c(pi_result, 1 - pi_result), ncol = 2, byrow = FALSE)

true_Y <- rep(NA, n)
for(i in 1:n){
    true_Y[i] = which(stats::rmultinom(1, 1, pi_matrix[i,]) == 1)
}

exp_zg = exp(Z %*% true_gamma)
pistar_denominator = matrix(c(1 + exp_zg[,1], 1 + exp_zg[,2]), ncol = 2, byrow = FALSE)
pistar_result = exp_zg / pistar_denominator

pistar_matrix = matrix(c(pistar_result[,1], 1 - pistar_result[,1],
                         pistar_result[,2], 1 - pistar_result[,2]),
                       ncol = 2, byrow = FALSE)

obs_Y <- rep(NA, n)
for(i in 1:n){
    true_j = true_Y[i]
    obs_Y[i] = which(rmultinom(1, 1,
                     pistar_matrix[c(i, n + i),
                                     true_j]) == 1)
 }

Ystar <- obs_Y

unif_lower_beta <- matrix(c(-5, -5, NA, NA), nrow = 2, byrow = TRUE)
unif_upper_beta <- matrix(c(5, 5, NA, NA), nrow = 2, byrow = TRUE)

unif_lower_gamma <- array(data = c(-5, NA, -5, NA, -5, NA, -5, NA),
                          dim = c(2,2,2))
unif_upper_gamma <- array(data = c(5, NA, 5, NA, 5, NA, 5, NA),
                          dim = c(2,2,2))

beta_prior_parameters <- list(lower = unif_lower_beta, upper = unif_upper_beta)
gamma_prior_parameters <- list(lower = unif_lower_gamma, upper = unif_upper_gamma)

MCMC_results <- COMBO_MCMC(Ystar, x = x_matrix, z = z_matrix,
                           prior = "uniform",
                           beta_prior_parameters = beta_prior_parameters,
                           gamma_prior_parameters = gamma_prior_parameters,
                           number_MCMC_chains = 2,
                           MCMC_sample = 200, burn_in = 100)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 1000
#>    Unobserved stochastic nodes: 6
#>    Total graph size: 35030
#> 
#> Initializing model
#> 
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 1000
#>    Unobserved stochastic nodes: 2
#>    Total graph size: 12013
#> 
#> Initializing model
#> 
MCMC_results$posterior_means_df# }
#> # A tibble: 6 × 3
#>   parameter_name posterior_mean posterior_median
#>   <fct>                   <dbl>            <dbl>
#> 1 beta[1,1]              1.05             1.07  
#> 2 beta[1,2]             -2.40            -2.39  
#> 3 gamma[1,1,1]           0.538            0.552 
#> 4 gamma[1,2,1]          -0.0422          -0.0907
#> 5 gamma[1,1,2]           0.982            0.904 
#> 6 gamma[1,2,2]          -1.61            -1.43