Masking certain entries from imputation model
2025-08-14
Source:vignettes/do_not_impute_matrix.rmd
do_not_impute_matrix.rmd
There are some cases in which we want to tell CISSVAE not to impute certain missing data entries. For example, we would not want to impute patient biomarker values for collection dates past date of death.
We can use the do_not_impute_matrix
argument in the
run_cissvae()
function to mask these values. The binary
do_not_impute_matrix
should have the same dimensions as the
input dataframe with each cell containing 0 (impute if missing) or 1 (do
not impute, even if missing).
Let’s use a simple survival dataset as an example. The mock survival dataset and its dni matrix can be loaded using the data function.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'kableExtra'
##
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(rCISSVAE)
library(tidyverse)
library(ggplot2)
library(gtsummary)
data(dni, mock_surv)
mock_surv %>%
tbl_summary(include = -c("patient_id")) %>%
as_kable()
Characteristic | N = 200 |
---|---|
death_event | 30 (15%) |
death_year | |
0.5 | 3 (10%) |
1 | 4 (13%) |
2 | 1 (3.3%) |
2.5 | 4 (13%) |
3 | 6 (20%) |
3.5 | 4 (13%) |
4 | 2 (6.7%) |
4.5 | 6 (20%) |
Unknown | 170 |
biomarker1_m00 | 0.08 (-0.27, 0.40) |
Unknown | 68 |
biomarker1_m06 | 0.20 (-0.14, 0.54) |
Unknown | 67 |
biomarker1_m12 | 0.31 (-0.02, 0.63) |
Unknown | 67 |
biomarker1_m18 | 0.41 (0.05, 0.83) |
Unknown | 65 |
biomarker1_m24 | 0.64 (0.22, 0.88) |
Unknown | 70 |
biomarker1_m30 | 0.71 (0.29, 1.09) |
Unknown | 66 |
biomarker1_m36 | 0.83 (0.45, 1.23) |
Unknown | 76 |
biomarker1_m42 | 0.95 (0.41, 1.36) |
Unknown | 71 |
biomarker1_m48 | 1.14 (0.61, 1.61) |
Unknown | 79 |
biomarker1_m54 | 1.22 (0.71, 1.56) |
Unknown | 90 |
biomarker1_m60 | 1.33 (0.73, 1.80) |
Unknown | 89 |
biomarker2_m00 | 0.01 (-0.29, 0.39) |
Unknown | 64 |
biomarker2_m06 | 0.06 (-0.35, 0.34) |
Unknown | 65 |
biomarker2_m12 | -0.03 (-0.28, 0.32) |
Unknown | 55 |
biomarker2_m18 | -0.06 (-0.35, 0.31) |
Unknown | 64 |
biomarker2_m24 | -0.10 (-0.44, 0.23) |
Unknown | 73 |
biomarker2_m30 | -0.22 (-0.60, 0.22) |
Unknown | 60 |
biomarker2_m36 | -0.23 (-0.66, 0.18) |
Unknown | 61 |
biomarker2_m42 | -0.33 (-0.71, 0.09) |
Unknown | 72 |
biomarker2_m48 | -0.42 (-0.91, -0.03) |
Unknown | 73 |
biomarker2_m54 | -0.42 (-0.95, 0.01) |
Unknown | 81 |
biomarker2_m60 | -0.44 (-1.04, 0.00) |
Unknown | 81 |
biomarker3_m00 | 0.08 (-0.22, 0.45) |
Unknown | 65 |
biomarker3_m06 | 0.08 (-0.26, 0.39) |
Unknown | 58 |
biomarker3_m12 | 0.09 (-0.30, 0.39) |
Unknown | 66 |
biomarker3_m18 | 0.06 (-0.28, 0.51) |
Unknown | 65 |
biomarker3_m24 | 0.09 (-0.32, 0.45) |
Unknown | 64 |
biomarker3_m30 | 0.10 (-0.30, 0.48) |
Unknown | 64 |
biomarker3_m36 | -0.02 (-0.34, 0.41) |
Unknown | 76 |
biomarker3_m42 | 0.09 (-0.21, 0.45) |
Unknown | 79 |
biomarker3_m48 | 0.15 (-0.48, 0.53) |
Unknown | 95 |
biomarker3_m54 | 0.13 (-0.40, 0.54) |
Unknown | 91 |
biomarker3_m60 | 0.06 (-0.49, 0.58) |
Unknown | 93 |
If we look at the entry for P007, which has a death event at year 3, we can see that the biomarker entries after month 36 (aka year 3) are marked as non-imputable.
print(mock_surv[7,])
## # A tibble: 1 × 36
## patient_id death_event death_year biomarker1_m00 biomarker1_m06 biomarker1_m12
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 P007 1 3 -0.165 NA 0.319
## # ℹ 30 more variables: biomarker1_m18 <dbl>, biomarker1_m24 <dbl>,
## # biomarker1_m30 <dbl>, biomarker1_m36 <dbl>, biomarker1_m42 <dbl>,
## # biomarker1_m48 <dbl>, biomarker1_m54 <dbl>, biomarker1_m60 <dbl>,
## # biomarker2_m00 <dbl>, biomarker2_m06 <dbl>, biomarker2_m12 <dbl>,
## # biomarker2_m18 <dbl>, biomarker2_m24 <dbl>, biomarker2_m30 <dbl>,
## # biomarker2_m36 <dbl>, biomarker2_m42 <dbl>, biomarker2_m48 <dbl>,
## # biomarker2_m54 <dbl>, biomarker2_m60 <dbl>, biomarker3_m00 <dbl>, …
print(dni[7,])
## # A tibble: 1 × 36
## patient_id death_event death_year biomarker1_m00 biomarker1_m06 biomarker1_m12
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0 0 0
## # ℹ 30 more variables: biomarker1_m18 <dbl>, biomarker1_m24 <dbl>,
## # biomarker1_m30 <dbl>, biomarker1_m36 <dbl>, biomarker1_m42 <dbl>,
## # biomarker1_m48 <dbl>, biomarker1_m54 <dbl>, biomarker1_m60 <dbl>,
## # biomarker2_m00 <dbl>, biomarker2_m06 <dbl>, biomarker2_m12 <dbl>,
## # biomarker2_m18 <dbl>, biomarker2_m24 <dbl>, biomarker2_m30 <dbl>,
## # biomarker2_m36 <dbl>, biomarker2_m42 <dbl>, biomarker2_m48 <dbl>,
## # biomarker2_m54 <dbl>, biomarker2_m60 <dbl>, biomarker3_m00 <dbl>, …
When running the cissvae model with this dataset, it is important
that the dataset and dni matrix have the same shape and column names.
This allows us to take advantage of the index_col
and
columns_ignore
arguments. We pass the dni matrix through
the do_not_impute_matrix
argument.
res <- run_cissvae(
mock_surv,
index_col = "patient_id",
columns_ignore = c("death_event","death_year"),
do_not_impute_matrix = dni,
val_proportion = 0.3,
return_clusters =FALSE,
return_history = FALSE,
epochs = 100,
leiden_resolution = 0.01,
k_neighbors = 5,
return_silhouettes = FALSE
)
## Cluster dataset:
## ClusterDataset(n_samples=200, n_features=35, n_clusters=53)
## • Original missing: 2543 / 6600 (38.53%)
## • Validation held-out: 2077 (51.20% of non-missing)
## • .data shape: (200, 35)
## • .masks shape: (200, 35)
## • .val_data shape: (200, 35)
When we unpack the results and look at the imputed dataset, we can see that P007 still has NAs for biomarkers past 36 months.
print(res$imputed_dataset[1:7,])
## patient_id death_event death_year biomarker1_m00 biomarker1_m06
## 0 P001 0 2.816667 0.08017647 -0.07296034
## 1 P002 0 2.816667 0.08017647 0.20198478
## 2 P003 0 2.816667 0.08017647 -0.34786278
## 3 P004 0 2.816667 0.08017647 0.17267427
## 4 P005 0 2.816667 0.82078469 1.48447073
## 5 P006 0 2.816667 0.07983060 0.24580511
## 6 P007 1 3.000000 -0.16468754 0.17267427
## biomarker1_m12 biomarker1_m18 biomarker1_m24 biomarker1_m30 biomarker1_m36
## 0 0.3055585 0.2332004 0.5508518 0.6703717 0.8551518
## 1 0.3008975 0.3917370 0.4851941 0.6703717 0.7560281
## 2 -0.2809101 -0.3463161 -0.1221673 0.1957781 0.1373877
## 3 0.3008975 0.8257963 0.5508518 0.6703717 1.1511669
## 4 1.5075537 1.4269625 1.7042559 1.7166684 1.8634248
## 5 0.3008975 0.2171040 0.2917295 0.4103813 0.7858086
## 6 0.3190216 0.5829263 0.7385894 0.9623325 1.0725095
## biomarker1_m42 biomarker1_m48 biomarker1_m54 biomarker1_m60 biomarker2_m00
## 0 0.8027524 0.9459516 1.324799 1.2813294 -0.28330877
## 1 0.8027524 1.3371121 1.332744 1.2813294 -0.26420641
## 2 0.1664135 0.1874206 1.332744 0.6016288 0.02322435
## 3 0.8027524 1.2363808 1.376007 1.4586281 0.80775446
## 4 1.9726140 2.1577544 2.650625 2.5509279 1.01649344
## 5 0.4664922 0.6859184 1.073819 0.8490065 -0.09670670
## 6 NaN NaN NaN NaN 0.06232529
## biomarker2_m06 biomarker2_m12 biomarker2_m18 biomarker2_m24 biomarker2_m30
## 0 0.04395184 -0.524100661 -0.04306518 -0.426104784 -0.55038774
## 1 -0.21099393 -0.060090043 -0.12981604 -0.138382033 -0.22393437
## 2 0.04395184 -0.338102847 -0.91641790 -0.669931769 -0.15691766
## 3 0.04395184 0.330412626 0.50364304 0.547709584 0.56977868
## 4 0.04395184 0.913325489 0.51596826 -0.138382033 0.57457983
## 5 -0.08690932 -0.008838233 -0.23246098 -0.138382033 -0.15691766
## 6 0.06246109 0.073259890 0.28398558 0.002752766 0.05857181
## biomarker2_m36 biomarker2_m42 biomarker2_m48 biomarker2_m54 biomarker2_m60
## 0 -0.48500043 -0.38082999 -0.8269995 -1.1647223 -0.5343538
## 1 -0.08213539 -0.28418347 -0.3139518 -0.5520591 -0.5343538
## 2 -1.16175210 -0.28418347 -1.3168920 -1.2316631 -1.4466560
## 3 0.43870801 -0.28418347 0.7032660 -0.5520591 0.3902867
## 4 0.12197453 -0.05538718 -0.2674445 -0.4097872 -0.4382522
## 5 -0.44593602 -0.31419751 -0.6617219 -0.5807806 -0.7427507
## 6 -0.36641607 NaN NaN NaN NaN
## biomarker3_m00 biomarker3_m06 biomarker3_m12 biomarker3_m18 biomarker3_m24
## 0 -0.33970779 -0.321329862 0.02692955 0.04827372 -0.35563889
## 1 -0.19654533 0.007062256 -0.30070391 -0.04024491 0.07211136
## 2 0.10233848 0.054567128 0.02692955 -0.02368132 0.08414590
## 3 0.45304477 0.054567128 0.52907348 0.31234542 0.16816741
## 4 1.09130454 0.054567128 1.03140509 0.83404821 0.16816741
## 5 0.10233848 -0.201686859 -0.12894012 -0.20327407 -0.20174129
## 6 0.05460874 0.054567128 0.02692955 0.17092685 0.39632505
## biomarker3_m30 biomarker3_m36 biomarker3_m42 biomarker3_m48 biomarker3_m54
## 0 -0.13013658 -0.16296671 -0.1260692775 -0.053942729 -0.15213469
## 1 0.06548417 -0.03495115 -0.0235729553 0.007716056 -0.02886459
## 2 0.14724579 0.03879652 0.0770729780 0.007716056 0.60639662
## 3 0.09883979 -0.03495115 -0.0008705333 0.007716056 -0.21237339
## 4 0.70956743 -0.03495115 0.8746145964 0.552908182 0.96187115
## 5 0.14724579 -0.02289455 -0.2137401998 -0.544870973 0.09003889
## 6 0.14724579 -0.03495115 NaN NaN NaN
## biomarker3_m60
## 0 0.01531991
## 1 0.10208759
## 2 0.57671410
## 3 -0.33763975
## 4 0.10208759
## 5 0.10208759
## 6 NaN