MCMC Estimation of the Two-Stage Binary Outcome Misclassification Model

Jointly estimate \(\beta\), \(\gamma^{(1)}\), and \(\gamma^{(2)}\) parameters from the true outcome first-stage observation, and second-stage observation mechanisms, respectively, in a two-stage binary outcome misclassification model.

Usage

COMBO_MCMC_2stage(
  Ystar1,
  Ystar2,
  x_matrix,
  z1_matrix,
  z2_matrix,
  prior,
  beta_prior_parameters,
  gamma1_prior_parameters,
  gamma2_prior_parameters,
  naive_gamma2_prior_parameters,
  number_MCMC_chains = 4,
  MCMC_sample = 2000,
  burn_in = 1000,
  display_progress = TRUE
)

Arguments

Ystar1: A numeric vector of indicator variables (1, 2) for the observed outcome \(Y^{*(1)}\). The reference category is 2.
Ystar2: A numeric vector of indicator variables (1, 2) for the second-stage observed outcome \(Y^{*(2)}\). There should be no NA terms. The reference category is 2.
x_matrix: A numeric matrix of covariates in the true outcome mechanism. x_matrix should not contain an intercept.
z1_matrix: A numeric matrix of covariates in the observation mechanism. z1_matrix should not contain an intercept.
z2_matrix: A numeric matrix of covariates in the second-stage observation mechanism. z2_matrix should not contain an intercept and no values should be NA.
prior: A character string specifying the prior distribution for the \(\beta\), \(\gamma^{(1)}\), and \(\gamma^{(2)}\) parameters. Options are "t", "uniform", "normal", or "dexp" (double Exponential, or Weibull).
beta_prior_parameters: A numeric list of prior distribution parameters for the \(\beta\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the first element of the list should contain a matrix of location, lower bound, mean, or shape parameters, respectively, for \(\beta\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the second element of the list should contain a matrix of shape, upper bound, standard deviation, or scale parameters, respectively, for \(\beta\) terms. For prior distribution "t", the third element of the list should contain a matrix of the degrees of freedom for \(\beta\) terms. The third list element should be empty for all other prior distributions. All matrices in the list should have dimensions n_cat X dim_x, and all elements in the n_cat row should be set to NA.
gamma1_prior_parameters: A numeric list of prior distribution parameters for the \(\gamma^{(1)}\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the first element of the list should contain an array of location, lower bound, mean, or shape parameters, respectively, for \(\gamma^{(1)}\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the second element of the list should contain an array of shape, upper bound, standard deviation, or scale parameters, respectively, for \(\gamma^{(1)}\) terms. For prior distribution "t", the third element of the list should contain an array of the degrees of freedom for \(\gamma^{(1)}\) terms. The third list element should be empty for all other prior distributions. All arrays in the list should have dimensions n_cat X n_cat X dim_z1, and all elements in the n_cat row should be set to NA.
gamma2_prior_parameters: A numeric list of prior distribution parameters for the \(\gamma^{(2)}\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the first element of the list should contain an array of location, lower bound, mean, or shape parameters, respectively, for \(\gamma^{(2)}\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the second element of the list should contain an array of shape, upper bound, standard deviation, or scale parameters, respectively, for \(\gamma^{(2)}\) terms. For prior distribution "t", the third element of the list should contain an array of the degrees of freedom for \(\gamma^{(2)}\) terms. The third list element should be empty for all other prior distributions. All arrays in the list should have dimensions n_cat X n_cat X n_cat X dim_z2, and all elements in the n_cat row should be set to NA.
naive_gamma2_prior_parameters: A numeric list of prior distribution parameters for the naive model \(\gamma^{(2)}\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the first element of the list should contain an array of location, lower bound, mean, or shape parameters, respectively, for naive \(\gamma^{(2)}\) terms. For prior distributions "t", "uniform", "normal", or "dexp", the second element of the list should contain an array of shape, upper bound, standard deviation, or scale parameters, respectively, for naive \(\gamma^{(2)}\) terms. For prior distribution "t", the third element of the list should contain an array of the degrees of freedom for naive \(\gamma^{(2)}\) terms. The third list element should be empty for all other prior distributions. All arrays in the list should have dimensions n_cat X n_cat X dim_z2, and all elements in the n_cat row should be set to NA. Note that prior distributions for the naive \(\beta\) terms are inherted from the beta_prior_parameters argument.
number_MCMC_chains: An integer specifying the number of MCMC chains to compute. The default is 4.
MCMC_sample: An integer specifying the number of MCMC samples to draw. The default is 2000.
burn_in: An integer specifying the number of MCMC samples to discard for the burn-in period. The default is 1000.
display_progress: A logical value specifying whether messages should be displayed during model compilation. The default is TRUE.

Value

COMBO_MCMC_2stage returns a list of the posterior samples and posterior means for both the binary outcome misclassification model and a naive logistic regression of the observed outcome, Y*, predicted by the matrix x. The list contains the following components:

posterior_sample_df: A data frame containing three columns. The first column indicates the chain from which a sample is taken, from 1 to number_MCMC_chains. The second column specifies the parameter associated with a given row. \(\beta\) terms have dimensions dim_x X n_cat. The \(\gamma^{(1)}\) terms have dimensions n_cat X n_cat X dim_z1, where the first index specifies the first-stage observed outcome category and the second index specifies the true outcome category. The \(\gamma^{(2)}\) terms have dimensions n_cat X n_cat X n_cat X dim_z2, where the first index specifies the second-stage observed outcome category, the second index specifies the first-stage observed outcome category, and the third index specifies the true outcome category. The final column provides the MCMC sample.
posterior_means_df: A data frame containing three columns. The first column specifies the parameter associated with a given row. Parameters are indexed as in the posterior_sample_df. The second column provides the posterior mean computed across all chains and all samples. The final column provides the posterior median computed across all chains and all samples.
naive_posterior_sample_df: A data frame containing three columns. The first column indicates the chain from which a sample is taken, from 1 to number_MCMC_chains. The second column specifies the parameter associated with a given row. Naive \(\beta\) terms have dimensions dim_x X n_cat. The final column provides the MCMC sample.
naive_posterior_means_df: A data frame containing three columns. The first column specifies the naive parameter associated with a given row. Parameters are indexed as in the naive_posterior_sample_df. The second column provides the posterior mean computed across all chains and all samples. The final column provides the posterior median computed across all chains and all samples.

Examples

# \donttest{

# Helper functions
sum_every_n <- function(x, n){
vector_groups = split(x,
                      ceiling(seq_along(x) / n))
sum_x = Reduce(`+`, vector_groups)

return(sum_x)
}

sum_every_n1 <- function(x, n){
vector_groups = split(x,
                      ceiling(seq_along(x) / n))
sum_x = Reduce(`+`, vector_groups) + 1

return(sum_x)
}

# Example

set.seed(123)
n <- 1000
x_mu <- 0
x_sigma <- 1
z1_shape <- 1
z2_shape <- 1

true_beta <- matrix(c(1, -2), ncol = 1)
true_gamma1 <- matrix(c(.5, 1, -.5, -1), nrow = 2, byrow = FALSE)
true_gamma2 <- array(c(1.5, 1, .5, .5, -.5, 0, -1, -1), dim = c(2, 2, 2))

x_matrix = matrix(rnorm(n, x_mu, x_sigma), ncol = 1)
X = matrix(c(rep(1, n), x_matrix[,1]), ncol = 2, byrow = FALSE)
z1_matrix = matrix(rgamma(n, z1_shape), ncol = 1)
Z1 = matrix(c(rep(1, n), z1_matrix[,1]), ncol = 2, byrow = FALSE)
z2_matrix = matrix(rgamma(n, z2_shape), ncol = 1)
Z2 = matrix(c(rep(1, n), z2_matrix[,1]), ncol = 2, byrow = FALSE)

exp_xb = exp(X %*% true_beta)
pi_result = exp_xb[,1] / (exp_xb[,1] + 1)
pi_matrix = matrix(c(pi_result, 1 - pi_result), ncol = 2, byrow = FALSE)

true_Y <- rep(NA, n)
for(i in 1:n){
    true_Y[i] = which(stats::rmultinom(1, 1, pi_matrix[i,]) == 1)
}

exp_z1g1 = exp(Z1 %*% true_gamma1)
pistar1_denominator = matrix(c(1 + exp_z1g1[,1], 1 + exp_z1g1[,2]),
                             ncol = 2, byrow = FALSE)
pistar1_result = exp_z1g1 / pistar1_denominator

pistar1_matrix = matrix(c(pistar1_result[,1], 1 - pistar1_result[,1],
                          pistar1_result[,2], 1 - pistar1_result[,2]),
                        ncol = 2, byrow = FALSE)


obs_Y1 <- rep(NA, n)
for(i in 1:n){
    true_j = true_Y[i]
    obs_Y1[i] = which(rmultinom(1, 1,
                      pistar1_matrix[c(i, n + i),
                                     true_j]) == 1)
 }

Ystar1 <- obs_Y1

exp_z2g2_1 = exp(Z2 %*% true_gamma2[,,1])
exp_z2g2_2 = exp(Z2 %*% true_gamma2[,,2])

pi_denominator1 = apply(exp_z2g2_1, FUN = sum_every_n1, n, MARGIN = 2)
pi_result1 = exp_z2g2_1 / rbind(pi_denominator1)

pi_denominator2 = apply(exp_z2g2_2, FUN = sum_every_n1, n, MARGIN = 2)
pi_result2 = exp_z2g2_2 / rbind(pi_denominator2)

pistar2_matrix1 = rbind(pi_result1,
                        1 - apply(pi_result1,
                                  FUN = sum_every_n, n = n,
                                  MARGIN = 2))

pistar2_matrix2 = rbind(pi_result2,
                        1 - apply(pi_result2,
                                  FUN = sum_every_n, n = n,
                                  MARGIN = 2))

pistar2_array = array(c(pistar2_matrix1, pistar2_matrix2),
                   dim = c(dim(pistar2_matrix1), 2))

obs_Y2 <- rep(NA, n)
for(i in 1:n){
    true_j = true_Y[i]
    obs_k = Ystar1[i]
    obs_Y2[i] = which(rmultinom(1, 1,
                                  pistar2_array[c(i,n+ i),
                                                obs_k, true_j]) == 1)
}

Ystar2 <- obs_Y2

unif_lower_beta <- matrix(c(-5, -5, NA, NA), nrow = 2, byrow = TRUE)
unif_upper_beta <- matrix(c(5, 5, NA, NA), nrow = 2, byrow = TRUE)

unif_lower_gamma1 <- array(data = c(-5, NA, -5, NA, -5, NA, -5, NA),
                          dim = c(2,2,2))
unif_upper_gamma1 <- array(data = c(5, NA, 5, NA, 5, NA, 5, NA),
                          dim = c(2,2,2))

unif_upper_gamma2 <- array(rep(c(5, NA), 8), dim = c(2,2,2,2))
unif_lower_gamma2 <- array(rep(c(-5, NA), 8), dim = c(2,2,2,2))

unif_lower_naive_gamma2 <- array(data = c(-5, NA, -5, NA, -5, NA, -5, NA),
                                dim = c(2,2,2))
unif_upper_naive_gamma2 <- array(data = c(5, NA, 5, NA, 5, NA, 5, NA),
                                dim = c(2,2,2))

beta_prior_parameters <- list(lower = unif_lower_beta, upper = unif_upper_beta)
gamma1_prior_parameters <- list(lower = unif_lower_gamma1, upper = unif_upper_gamma1)
gamma2_prior_parameters <- list(lower = unif_lower_gamma2, upper = unif_upper_gamma2)
naive_gamma2_prior_parameters <- list(lower = unif_lower_naive_gamma2,
                                      upper = unif_upper_naive_gamma2)

MCMC_results <- COMBO_MCMC_2stage(Ystar1, Ystar2,
                                  x_matrix = x_matrix, z1_matrix = z1_matrix,
                                  z2_matrix = z2_matrix,
                                  prior = "uniform",
                                  beta_prior_parameters = beta_prior_parameters,
                                  gamma1_prior_parameters = gamma1_prior_parameters,
                                  gamma2_prior_parameters = gamma2_prior_parameters,
                                  naive_gamma2_prior_parameters = naive_gamma2_prior_parameters,
                                  number_MCMC_chains = 2,
                                  MCMC_sample = 200, burn_in = 100)
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 2000
#>    Unobserved stochastic nodes: 14
#>    Total graph size: 81063
#> 
#> Initializing model
#> 
#> Compiling model graph
#>    Resolving undeclared variables
#>    Allocating nodes
#> Graph information:
#>    Observed stochastic nodes: 2000
#>    Unobserved stochastic nodes: 6
#>    Total graph size: 36030
#> 
#> Initializing model
#> 
MCMC_results$posterior_means_df# }
#> # A tibble: 14 × 3
#>    parameter_name  posterior_mean posterior_median
#>    <fct>                    <dbl>            <dbl>
#>  1 beta[1,1]                1.59             1.60 
#>  2 beta[1,2]               -2.18            -2.13 
#>  3 gamma1[1,1,1]            0.340            0.329
#>  4 gamma1[1,2,1]           -1.17            -1.15 
#>  5 gamma1[1,1,2]            1.20             1.21 
#>  6 gamma1[1,2,2]           -2.15            -2.05 
#>  7 gamma2[1,1,1,1]          0.853            0.766
#>  8 gamma2[1,2,1,1]          0.578            0.368
#>  9 gamma2[1,1,2,1]         -0.342           -0.383
#> 10 gamma2[1,2,2,1]         -0.933           -0.826
#> 11 gamma2[1,1,1,2]          1.18             1.17 
#> 12 gamma2[1,2,1,2]          2.65             2.59 
#> 13 gamma2[1,1,2,2]         -0.656           -0.896
#> 14 gamma2[1,2,2,2]         -2.48            -2.41