Skip to contents

There are some cases in which we want to tell CISSVAE not to impute certain missing data entries. For example, we would not want to impute patient biomarker values for collection dates past date of death.

We can use the do_not_impute_matrix argument in the run_cissvae() function to mask these values. The binary do_not_impute_matrix should have the same dimensions as the input dataframe with each cell containing 0 (impute if missing) or 1 (do not impute, even if missing).

Let’s use a simple survival dataset as an example. The mock survival dataset and its dni matrix can be loaded using the data function.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
##  dplyr     1.1.4      readr     2.1.5
##  forcats   1.0.0      stringr   1.5.2
##  ggplot2   4.0.0      tibble    3.3.0
##  lubridate 1.9.3      tidyr     1.3.1
##  purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
##  dplyr::filter() masks stats::filter()
##  dplyr::lag()    masks stats::lag()
##  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'kableExtra'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(rCISSVAE)
library(tidyverse)
library(ggplot2)
library(gtsummary)

data(dni, mock_surv)

mock_surv %>% 
   tbl_summary(include = -c("patient_id")) %>%
   as_kable()
Characteristic N = 200
death_event 30 (15%)
death_year
0.5 3 (10%)
1 4 (13%)
2 1 (3.3%)
2.5 4 (13%)
3 6 (20%)
3.5 4 (13%)
4 2 (6.7%)
4.5 6 (20%)
Unknown 170
biomarker1_m00 0.08 (-0.27, 0.40)
Unknown 68
biomarker1_m06 0.20 (-0.14, 0.54)
Unknown 67
biomarker1_m12 0.31 (-0.02, 0.63)
Unknown 67
biomarker1_m18 0.41 (0.05, 0.83)
Unknown 65
biomarker1_m24 0.64 (0.22, 0.88)
Unknown 70
biomarker1_m30 0.71 (0.29, 1.09)
Unknown 66
biomarker1_m36 0.83 (0.45, 1.23)
Unknown 76
biomarker1_m42 0.95 (0.41, 1.36)
Unknown 71
biomarker1_m48 1.14 (0.61, 1.61)
Unknown 79
biomarker1_m54 1.22 (0.71, 1.56)
Unknown 90
biomarker1_m60 1.33 (0.73, 1.80)
Unknown 89
biomarker2_m00 0.01 (-0.29, 0.39)
Unknown 64
biomarker2_m06 0.06 (-0.35, 0.34)
Unknown 65
biomarker2_m12 -0.03 (-0.28, 0.32)
Unknown 55
biomarker2_m18 -0.06 (-0.35, 0.31)
Unknown 64
biomarker2_m24 -0.10 (-0.44, 0.23)
Unknown 73
biomarker2_m30 -0.22 (-0.60, 0.22)
Unknown 60
biomarker2_m36 -0.23 (-0.66, 0.18)
Unknown 61
biomarker2_m42 -0.33 (-0.71, 0.09)
Unknown 72
biomarker2_m48 -0.42 (-0.91, -0.03)
Unknown 73
biomarker2_m54 -0.42 (-0.95, 0.01)
Unknown 81
biomarker2_m60 -0.44 (-1.04, 0.00)
Unknown 81
biomarker3_m00 0.08 (-0.22, 0.45)
Unknown 65
biomarker3_m06 0.08 (-0.26, 0.39)
Unknown 58
biomarker3_m12 0.09 (-0.30, 0.39)
Unknown 66
biomarker3_m18 0.06 (-0.28, 0.51)
Unknown 65
biomarker3_m24 0.09 (-0.32, 0.45)
Unknown 64
biomarker3_m30 0.10 (-0.30, 0.48)
Unknown 64
biomarker3_m36 -0.02 (-0.34, 0.41)
Unknown 76
biomarker3_m42 0.09 (-0.21, 0.45)
Unknown 79
biomarker3_m48 0.15 (-0.48, 0.53)
Unknown 95
biomarker3_m54 0.13 (-0.40, 0.54)
Unknown 91
biomarker3_m60 0.06 (-0.49, 0.58)
Unknown 93

If we look at the entry for P007, which has a death event at year 3, we can see that the biomarker entries after month 36 (aka year 3) are marked as non-imputable.

print(mock_surv[7,])
## # A tibble: 1 × 36
##   patient_id death_event death_year biomarker1_m00 biomarker1_m06 biomarker1_m12
##   <chr>            <dbl>      <dbl>          <dbl>          <dbl>          <dbl>
## 1 P007                 1          3         -0.165             NA          0.319
## # ℹ 30 more variables: biomarker1_m18 <dbl>, biomarker1_m24 <dbl>,
## #   biomarker1_m30 <dbl>, biomarker1_m36 <dbl>, biomarker1_m42 <dbl>,
## #   biomarker1_m48 <dbl>, biomarker1_m54 <dbl>, biomarker1_m60 <dbl>,
## #   biomarker2_m00 <dbl>, biomarker2_m06 <dbl>, biomarker2_m12 <dbl>,
## #   biomarker2_m18 <dbl>, biomarker2_m24 <dbl>, biomarker2_m30 <dbl>,
## #   biomarker2_m36 <dbl>, biomarker2_m42 <dbl>, biomarker2_m48 <dbl>,
## #   biomarker2_m54 <dbl>, biomarker2_m60 <dbl>, biomarker3_m00 <dbl>, …
print(dni[7,])
## # A tibble: 1 × 36
##   patient_id death_event death_year biomarker1_m00 biomarker1_m06 biomarker1_m12
##        <dbl>       <dbl>      <dbl>          <dbl>          <dbl>          <dbl>
## 1          0           0          0              0              0              0
## # ℹ 30 more variables: biomarker1_m18 <dbl>, biomarker1_m24 <dbl>,
## #   biomarker1_m30 <dbl>, biomarker1_m36 <dbl>, biomarker1_m42 <dbl>,
## #   biomarker1_m48 <dbl>, biomarker1_m54 <dbl>, biomarker1_m60 <dbl>,
## #   biomarker2_m00 <dbl>, biomarker2_m06 <dbl>, biomarker2_m12 <dbl>,
## #   biomarker2_m18 <dbl>, biomarker2_m24 <dbl>, biomarker2_m30 <dbl>,
## #   biomarker2_m36 <dbl>, biomarker2_m42 <dbl>, biomarker2_m48 <dbl>,
## #   biomarker2_m54 <dbl>, biomarker2_m60 <dbl>, biomarker3_m00 <dbl>, …

When running the cissvae model with this dataset, it is important that the dataset and dni matrix have the same shape and column names. This allows us to take advantage of the index_col and columns_ignore arguments. We pass the dni matrix through the do_not_impute_matrix argument.

res <- run_cissvae(
  mock_surv,
  index_col = "patient_id",
  columns_ignore = c("death_event","death_year"),
  do_not_impute_matrix = dni,   
  val_proportion = 0.3,
  return_clusters =FALSE,
  return_history = FALSE,
  epochs = 100,
  leiden_resolution = 0.01,
  k_neighbors = 5,
  return_silhouettes = FALSE
)
## Cluster dataset:
##  ClusterDataset(n_samples=200, n_features=35, n_clusters=53)
##   • Original missing: 2543 / 6600 (38.53%)
##   • Validation held-out: 2077 (51.20% of non-missing)
##   • .data shape:     (200, 35)
##   • .masks shape:    (200, 35)
##   • .val_data shape: (200, 35)

When we unpack the results and look at the imputed dataset, we can see that P007 still has NAs for biomarkers past 36 months.

print(res$imputed_dataset[1:7,])
##   patient_id death_event death_year biomarker1_m00 biomarker1_m06
## 0       P001           0   2.816667     0.08017647    -0.07296034
## 1       P002           0   2.816667     0.08017647     0.20198478
## 2       P003           0   2.816667     0.08017647    -0.34786278
## 3       P004           0   2.816667     0.08017647     0.17267427
## 4       P005           0   2.816667     0.82078469     1.48447073
## 5       P006           0   2.816667     0.07983060     0.24580511
## 6       P007           1   3.000000    -0.16468754     0.17267427
##   biomarker1_m12 biomarker1_m18 biomarker1_m24 biomarker1_m30 biomarker1_m36
## 0      0.3055585      0.2332004      0.5508518      0.6703717      0.8551518
## 1      0.3008975      0.3917370      0.4851941      0.6703717      0.7560281
## 2     -0.2809101     -0.3463161     -0.1221673      0.1957781      0.1373877
## 3      0.3008975      0.8257963      0.5508518      0.6703717      1.1511669
## 4      1.5075537      1.4269625      1.7042559      1.7166684      1.8634248
## 5      0.3008975      0.2171040      0.2917295      0.4103813      0.7858086
## 6      0.3190216      0.5829263      0.7385894      0.9623325      1.0725095
##   biomarker1_m42 biomarker1_m48 biomarker1_m54 biomarker1_m60 biomarker2_m00
## 0      0.8027524      0.9459516       1.324799      1.2813294    -0.28330877
## 1      0.8027524      1.3371121       1.332744      1.2813294    -0.26420641
## 2      0.1664135      0.1874206       1.332744      0.6016288     0.02322435
## 3      0.8027524      1.2363808       1.376007      1.4586281     0.80775446
## 4      1.9726140      2.1577544       2.650625      2.5509279     1.01649344
## 5      0.4664922      0.6859184       1.073819      0.8490065    -0.09670670
## 6            NaN            NaN            NaN            NaN     0.06232529
##   biomarker2_m06 biomarker2_m12 biomarker2_m18 biomarker2_m24 biomarker2_m30
## 0     0.04395184   -0.524100661    -0.04306518   -0.426104784    -0.55038774
## 1    -0.21099393   -0.060090043    -0.12981604   -0.138382033    -0.22393437
## 2     0.04395184   -0.338102847    -0.91641790   -0.669931769    -0.15691766
## 3     0.04395184    0.330412626     0.50364304    0.547709584     0.56977868
## 4     0.04395184    0.913325489     0.51596826   -0.138382033     0.57457983
## 5    -0.08690932   -0.008838233    -0.23246098   -0.138382033    -0.15691766
## 6     0.06246109    0.073259890     0.28398558    0.002752766     0.05857181
##   biomarker2_m36 biomarker2_m42 biomarker2_m48 biomarker2_m54 biomarker2_m60
## 0    -0.48500043    -0.38082999     -0.8269995     -1.1647223     -0.5343538
## 1    -0.08213539    -0.28418347     -0.3139518     -0.5520591     -0.5343538
## 2    -1.16175210    -0.28418347     -1.3168920     -1.2316631     -1.4466560
## 3     0.43870801    -0.28418347      0.7032660     -0.5520591      0.3902867
## 4     0.12197453    -0.05538718     -0.2674445     -0.4097872     -0.4382522
## 5    -0.44593602    -0.31419751     -0.6617219     -0.5807806     -0.7427507
## 6    -0.36641607            NaN            NaN            NaN            NaN
##   biomarker3_m00 biomarker3_m06 biomarker3_m12 biomarker3_m18 biomarker3_m24
## 0    -0.33970779   -0.321329862     0.02692955     0.04827372    -0.35563889
## 1    -0.19654533    0.007062256    -0.30070391    -0.04024491     0.07211136
## 2     0.10233848    0.054567128     0.02692955    -0.02368132     0.08414590
## 3     0.45304477    0.054567128     0.52907348     0.31234542     0.16816741
## 4     1.09130454    0.054567128     1.03140509     0.83404821     0.16816741
## 5     0.10233848   -0.201686859    -0.12894012    -0.20327407    -0.20174129
## 6     0.05460874    0.054567128     0.02692955     0.17092685     0.39632505
##   biomarker3_m30 biomarker3_m36 biomarker3_m42 biomarker3_m48 biomarker3_m54
## 0    -0.13013658    -0.16296671  -0.1260692775   -0.053942729    -0.15213469
## 1     0.06548417    -0.03495115  -0.0235729553    0.007716056    -0.02886459
## 2     0.14724579     0.03879652   0.0770729780    0.007716056     0.60639662
## 3     0.09883979    -0.03495115  -0.0008705333    0.007716056    -0.21237339
## 4     0.70956743    -0.03495115   0.8746145964    0.552908182     0.96187115
## 5     0.14724579    -0.02289455  -0.2137401998   -0.544870973     0.09003889
## 6     0.14724579    -0.03495115            NaN            NaN            NaN
##   biomarker3_m60
## 0     0.01531991
## 1     0.10208759
## 2     0.57671410
## 3    -0.33763975
## 4     0.10208759
## 5     0.10208759
## 6            NaN