This function wraps the Python run_cissvae function from the ciss_vae package,
providing a complete pipeline for missing data imputation using a Cluster-Informed
Shared and Specific Variational Autoencoder (CISS-VAE). The function handles data
preprocessing, model training, and returns imputed data along with optional
model artifacts.
The CISS-VAE architecture uses cluster information to learn both shared and cluster-specific representations, enabling more accurate imputation by leveraging patterns within and across different data subgroups.
Usage
run_cissvae(
data,
index_col = NULL,
val_proportion = 0.1,
replacement_value = 0,
columns_ignore = NULL,
imputable_matrix = NULL,
binary_feature_mask = NULL,
print_dataset = TRUE,
clusters = NULL,
n_clusters = NULL,
seed = 42,
missingness_proportion_matrix = NULL,
scale_features = FALSE,
k_neighbors = 15L,
leiden_resolution = 0.5,
leiden_objective = "CPM",
hidden_dims = c(150, 120, 60),
latent_dim = 15,
layer_order_enc = c("unshared", "unshared", "unshared"),
layer_order_dec = c("shared", "shared", "shared"),
latent_shared = FALSE,
output_shared = FALSE,
batch_size = 4000,
epochs = 500,
initial_lr = 0.01,
decay_factor = 0.999,
weight_decay = 0.001,
beta = 0.001,
device = NULL,
max_loops = 100,
patience = 2,
epochs_per_loop = NULL,
initial_lr_refit = NULL,
decay_factor_refit = NULL,
beta_refit = NULL,
verbose = FALSE,
return_model = TRUE,
return_clusters = FALSE,
return_silhouettes = FALSE,
return_history = FALSE,
return_dataset = FALSE,
return_validation_dataset = FALSE,
debug = FALSE
)Arguments
- data
A data.frame or matrix (samples × features) containing the data to impute. May contain
NAvalues which will be imputed.- index_col
Character. Name of column in
datato treat as sample identifier. This column will be removed before training and re-attached to results. DefaultNULL.- val_proportion
Numeric. Fraction of non-missing entries to hold out for validation during training. Must be between 0 and 1. Default
0.1.- replacement_value
Numeric. Fill value for masked entries during training. Default
0.0.- columns_ignore
Character or integer vector. Columns to exclude from validation set. Can specify by name or index. Default
NULL.- imputable_matrix
Logical matrix indicating entries allowed to be imputed.
- binary_feature_mask
Logical vector marking which columns are binary.
- print_dataset
Logical. If
TRUE, prints dataset summary information during processing. DefaultTRUE.- clusters
Optional vector or single-column data.frame of precomputed cluster labels for samples. If
NULL, clustering will be performed automatically. DefaultNULL.- n_clusters
Integer. Number of clusters for KMeans clustering when
clustersisNULL. Number of clusters for KMeans clustering when 'clusters' is NULL. IfNULL, will use Leiden for clustering. DefaultNULL.- seed
Integer. Random seed for reproducible results. Default
42.- missingness_proportion_matrix
Optional pre-computed missingness proportion matrix for biomarker-based clustering. If provided, clustering will be based on these proportions. Default
NULL.- scale_features
Logical. Whether to scale features when using missingness proportion matrix clustering. Default
FALSE.- k_neighbors
Integer. Number of nearest neighbors for Leiden clustering. Defaults to 15.
- leiden_resolution
Float. Resolution parameter for Leiden clustering. Defaults to 0.5.
- leiden_objective
Character. Objective function for Leiden clustering. One of ("CPM", "RB", "Modularity")
Integer vector. Sizes of hidden layers in encoder/decoder. Length determines number of hidden layers. Default
c(150, 120, 60).- latent_dim
Integer. Dimension of latent space representation. Default
15.- layer_order_enc
Character vector. Sharing pattern for encoder layers. Each element should be "shared" or "unshared". Length must match
length(hidden_dims). Defaultc("unshared", "unshared", "unshared").- layer_order_dec
Character vector. Sharing pattern for decoder layers. Each element should be "shared" or "unshared". Length must match
length(hidden_dims). Defaultc("shared", "shared", "shared").Logical. Whether latent space weights are shared across clusters. Default
FALSE.Logical. Whether output layer weights are shared across clusters. Default
FALSE.- batch_size
Integer. Mini-batch size for training. Larger values may improve training stability but require more memory. Default
4000.- epochs
Integer. Number of epochs for initial training phase. Default
500.- initial_lr
Numeric. Initial learning rate for optimizer. Default
0.01.- decay_factor
Numeric. Exponential decay factor for learning rate scheduling. Must be between 0 and 1. Default
0.999.- weight_decay
Weight decay (L2 penalty) used in Adam optimizer.
- beta
Numeric. Weight for KL divergence term in VAE loss function. Controls regularization strength. Default
0.001.- device
Character. Device specification for computation ("cpu" or "cuda"). If
NULL, automatically selects best available device. DefaultNULL.- max_loops
Integer. Maximum number of impute-refit loops to perform. Default
100.- patience
Integer. Early stopping patience for refit loops. Training stops if validation loss doesn't improve for this many consecutive loops. Default
2.- epochs_per_loop
Integer. Number of epochs per refit loop. If
NULL, uses same value asepochs. DefaultNULL.- initial_lr_refit
Numeric. Learning rate for refit loops. If
NULL, uses same value asinitial_lr. DefaultNULL.- decay_factor_refit
Numeric. Decay factor for refit loops. If
NULL, uses same value asdecay_factor. DefaultNULL.- beta_refit
Numeric. KL weight for refit loops. If
NULL, uses same value asbeta. DefaultNULL.- verbose
Logical. If
TRUE, prints detailed progress information during training. DefaultFALSE.- return_model
Logical. If
TRUE, returns the trained Python VAE model object. DefaultTRUE.- return_clusters
Logical. If TRUE returns cluster vector
- return_silhouettes
Logical. If
TRUE, returns silhouette scores for cluster quality assessment. DefaultFALSE.- return_history
Logical. If
TRUE, returns training history as a data.frame containing loss values and metrics over epochs. DefaultFALSE.- return_dataset
Logical. If
TRUE, returns the ClusterDataset object used during training (contains validation data, masks, etc.). DefaultFALSE.- return_validation_dataset
Logical. If
TRUEreturns validation dataset- debug
Logical; if TRUE, additional metadata is returned for debugging.
Value
A list containing imputed data and optional additional outputs:
- imputed_dataset
data.frame of imputed data with same dimensions as input. Missing values are filled with model predictions. If
index_colwas provided, it is re-attached as the first column.- model
(if
return_model=TRUE) Python CISSVAE model object. Can be used for further analysis or predictions.- cluster_dataset
(if
return_dataset=TRUE) Python ClusterDataset object containing validation data, masks, normalization parameters, and cluster labels. Can be used with performance_by_cluster() and other analysis functions.- clusters
(if
return_clusters=TRUE) Returns vector of cluster assignments- silhouettes
(if
return_silhouettes=TRUE) Numeric silhouette score measuring cluster separation quality.- training_history
(if
return_history=TRUE) data.frame containing training history with columns for epoch, losses, and validation metrics.- val_data
(if
return_validation_dataset=TRUE) data.frame containing values held aside for validation.- val_imputed
(if
return_validation_dataset=TRUE) data.frame containing imputed values of set held aside for validation.
Details
The CISS-VAE method works in two main phases:
Initial Training: The model is trained on the original data with validation holdout to learn initial representations and imputation patterns.
Impute-Refit Loops: The model iteratively imputes missing values and retrains on the updated dataset until convergence or maximum loops reached.
The architecture uses both shared and cluster-specific layers to capture:
Shared patterns: Common relationships across all clusters
Specific patterns: Unique relationships within each cluster
Requirements
This function requires the Python ciss_vae package to be installed and
accessible via reticulate.
Performance tips
If Leiden clustering yields too many clusters, consider increasing
k_neighborsor reducingleiden_resolution.Use GPU computation when available for faster training on large datasets. Use
check_devices()to see what devices are available.Adjust
batch_sizebased on available memory (larger is faster but uses more memory).Set
verbose = TRUEto monitor training progress.
See also
create_missingness_prop_matrix for creating missingness proportion matrices
performance_by_cluster for analyzing model performance using the returned dataset
Examples
# \donttest{
## Requires a working Python environment via reticulate
## Examples are wrapped in try() to avoid failures on CRAN check systems
library(rCISSVAE)
data(df_missing)
data(clusters)
try({
dat = run_cissvae(
data = df_missing,
index_col = "index",
val_proportion = 0.1, ## pass a vector for different proportions by cluster
columns_ignore = c("Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003"),
clusters = clusters$clusters, ## we have precomputed cluster labels so we pass them here
epochs = 5,
return_silhouettes = FALSE,
return_history = TRUE, # Get detailed training history
verbose = FALSE,
return_model = TRUE, ## Allows for plotting model schematic
device = "cpu", # Explicit device selection
layer_order_enc = c("unshared", "shared", "unshared"),
layer_order_dec = c("shared", "unshared", "shared")
)
})
#> Error in py_module_import(module, convert = convert) :
#> ModuleNotFoundError: No module named 'ciss_vae'
#> Run `reticulate::py_last_error()` for details.
# }