Masking certain entries from imputation model
2026-01-20
Source:vignettes/do_not_impute_matrix.rmd
do_not_impute_matrix.rmdThere are some cases in which we want to tell the CISSVAE model not to impute certain missing data entries. For example, we may not want to impute patient biomarker values for collection dates past date of death.
We can use the imputable_matrix argument in the
run_cissvae() function to mask these values. The binary
imputable_matrix should have the same dimensions as the
input dataframe, with each cell containing 1 (impute if missing) or 0
(do not impute, even if missing).
Let’s use a simple survival dataset as an example. The mock survival dataset and its imputable_matrix (dni) can be loaded using the data function.
library(rCISSVAE)
data(dni, mock_surv)
mock_surv %>%
tbl_summary(include = -c("patient_id")) %>%
as_kable()| Characteristic | N = 200 |
|---|---|
| death_event | 30 (15%) |
| death_year | |
| 0.5 | 3 (10%) |
| 1 | 4 (13%) |
| 2 | 1 (3.3%) |
| 2.5 | 4 (13%) |
| 3 | 6 (20%) |
| 3.5 | 4 (13%) |
| 4 | 2 (6.7%) |
| 4.5 | 6 (20%) |
| Unknown | 170 |
| biomarker1_m00 | 0.08 (-0.27, 0.40) |
| Unknown | 68 |
| biomarker1_m06 | 0.20 (-0.14, 0.54) |
| Unknown | 67 |
| biomarker1_m12 | 0.31 (-0.02, 0.63) |
| Unknown | 67 |
| biomarker1_m18 | 0.41 (0.05, 0.83) |
| Unknown | 65 |
| biomarker1_m24 | 0.64 (0.22, 0.88) |
| Unknown | 70 |
| biomarker1_m30 | 0.71 (0.29, 1.09) |
| Unknown | 66 |
| biomarker1_m36 | 0.83 (0.45, 1.23) |
| Unknown | 76 |
| biomarker1_m42 | 0.95 (0.41, 1.36) |
| Unknown | 71 |
| biomarker1_m48 | 1.14 (0.61, 1.61) |
| Unknown | 79 |
| biomarker1_m54 | 1.22 (0.71, 1.56) |
| Unknown | 90 |
| biomarker1_m60 | 1.33 (0.73, 1.80) |
| Unknown | 89 |
| biomarker2_m00 | 0.01 (-0.29, 0.39) |
| Unknown | 64 |
| biomarker2_m06 | 0.06 (-0.35, 0.34) |
| Unknown | 65 |
| biomarker2_m12 | -0.03 (-0.28, 0.32) |
| Unknown | 55 |
| biomarker2_m18 | -0.06 (-0.35, 0.31) |
| Unknown | 64 |
| biomarker2_m24 | -0.10 (-0.44, 0.23) |
| Unknown | 73 |
| biomarker2_m30 | -0.22 (-0.60, 0.22) |
| Unknown | 60 |
| biomarker2_m36 | -0.23 (-0.66, 0.18) |
| Unknown | 61 |
| biomarker2_m42 | -0.33 (-0.71, 0.09) |
| Unknown | 72 |
| biomarker2_m48 | -0.42 (-0.91, -0.03) |
| Unknown | 73 |
| biomarker2_m54 | -0.42 (-0.95, 0.01) |
| Unknown | 81 |
| biomarker2_m60 | -0.44 (-1.04, 0.00) |
| Unknown | 81 |
| biomarker3_m00 | 0.08 (-0.22, 0.45) |
| Unknown | 65 |
| biomarker3_m06 | 0.08 (-0.26, 0.39) |
| Unknown | 58 |
| biomarker3_m12 | 0.09 (-0.30, 0.39) |
| Unknown | 66 |
| biomarker3_m18 | 0.06 (-0.28, 0.51) |
| Unknown | 65 |
| biomarker3_m24 | 0.09 (-0.32, 0.45) |
| Unknown | 64 |
| biomarker3_m30 | 0.10 (-0.30, 0.48) |
| Unknown | 64 |
| biomarker3_m36 | -0.02 (-0.34, 0.41) |
| Unknown | 76 |
| biomarker3_m42 | 0.09 (-0.21, 0.45) |
| Unknown | 79 |
| biomarker3_m48 | 0.15 (-0.48, 0.53) |
| Unknown | 95 |
| biomarker3_m54 | 0.13 (-0.40, 0.54) |
| Unknown | 91 |
| biomarker3_m60 | 0.06 (-0.49, 0.58) |
| Unknown | 93 |
Understanding the ‘Imputable’ matrix
| patient_id | death_event | death_year | biomarker1_m00 | biomarker1_m06 | biomarker1_m12 | biomarker1_m18 | biomarker1_m24 | biomarker1_m30 | biomarker1_m36 | biomarker1_m42 | biomarker1_m48 | biomarker1_m54 | biomarker1_m60 | biomarker2_m00 | biomarker2_m06 | biomarker2_m12 | biomarker2_m18 | biomarker2_m24 | biomarker2_m30 | biomarker2_m36 | biomarker2_m42 | biomarker2_m48 | biomarker2_m54 | biomarker2_m60 | biomarker3_m00 | biomarker3_m06 | biomarker3_m12 | biomarker3_m18 | biomarker3_m24 | biomarker3_m30 | biomarker3_m36 | biomarker3_m42 | biomarker3_m48 | biomarker3_m54 | biomarker3_m60 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| P007 | 1 | 3 | -0.1646875 | NA | 0.3190216 | 0.5829264 | 0.7385894 | 0.9623325 | 1.07251 | NA | NA | NA | NA | 0.0623253 | 0.0624611 | 0.0732599 | 0.2839856 | 0.0027528 | 0.0585718 | NA | NA | NA | NA | NA | 0.0546087 | NA | NA | 0.1709268 | 0.3963251 | NA | NA | NA | NA | NA | NA |
| patient_id | death_event | death_year | biomarker1_m00 | biomarker1_m06 | biomarker1_m12 | biomarker1_m18 | biomarker1_m24 | biomarker1_m30 | biomarker1_m36 | biomarker1_m42 | biomarker1_m48 | biomarker1_m54 | biomarker1_m60 | biomarker2_m00 | biomarker2_m06 | biomarker2_m12 | biomarker2_m18 | biomarker2_m24 | biomarker2_m30 | biomarker2_m36 | biomarker2_m42 | biomarker2_m48 | biomarker2_m54 | biomarker2_m60 | biomarker3_m00 | biomarker3_m06 | biomarker3_m12 | biomarker3_m18 | biomarker3_m24 | biomarker3_m30 | biomarker3_m36 | biomarker3_m42 | biomarker3_m48 | biomarker3_m54 | biomarker3_m60 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
If we look at the entry for P007, which has a death event at year 3, we can see that the biomarker entries after month 36 (aka year 3) are marked as non-imputable.
When running the cissvae model with this dataset, it is important
that the dataset and dni matrix have the same shape and column names.
This allows us to take advantage of the index_col and
columns_ignore arguments. We pass the dni matrix through
the imputable_matrix argument.
res <- run_cissvae(
mock_surv,
index_col = "patient_id",
columns_ignore = c("death_event","death_year"),
imputable_matrix = dni,
val_proportion = 0.3,
return_clusters =FALSE,
return_history = FALSE,
epochs = 100,
leiden_resolution = 0.01,
k_neighbors = 5,
return_silhouettes = FALSE
)
res$imputed_dataset[1:7,] %>% kableExtra::kable()When we unpack the results and look at the imputed dataset, we can see that P007 still has NAs for biomarkers past 36 months.