This function wraps the Python run_cissvae
function from the ciss_vae
module,
providing a complete pipeline for missing data imputation using a Cluster-Informed
Shared and Specific Variational Autoencoder (CISS-VAE). The function handles data
preprocessing, model training, and returns imputed data along with optional
model artifacts.
The CISS-VAE architecture uses cluster information to learn both shared and cluster-specific representations, enabling more accurate imputation by leveraging patterns within and across different data subgroups.
Usage
run_cissvae(
data,
index_col = NULL,
val_proportion = 0.1,
replacement_value = 0,
columns_ignore = NULL,
print_dataset = TRUE,
clusters = NULL,
n_clusters = NULL,
seed = 42,
missingness_proportion_matrix = NULL,
scale_features = FALSE,
k_neighbors = 15L,
leiden_resolution = 0.5,
leiden_objective = "CPM",
hidden_dims = c(150, 120, 60),
latent_dim = 15,
layer_order_enc = c("unshared", "unshared", "unshared"),
layer_order_dec = c("shared", "shared", "shared"),
latent_shared = FALSE,
output_shared = FALSE,
batch_size = 4000,
epochs = 500,
initial_lr = 0.01,
decay_factor = 0.999,
beta = 0.001,
device = NULL,
max_loops = 100,
patience = 2,
epochs_per_loop = NULL,
initial_lr_refit = NULL,
decay_factor_refit = NULL,
beta_refit = NULL,
verbose = FALSE,
return_model = TRUE,
return_clusters = FALSE,
return_silhouettes = FALSE,
return_history = FALSE,
return_dataset = FALSE,
do_not_impute_matrix = NULL,
debug = FALSE
)
Arguments
- data
A data.frame or matrix (samples × features) containing the data to impute. May contain
NA
values which will be imputed.- index_col
Character. Name of column in
data
to treat as sample identifier. This column will be removed before training and re-attached to results. DefaultNULL
.- val_proportion
Numeric. Fraction of non-missing entries to hold out for validation during training. Must be between 0 and 1. Default
0.1
.- replacement_value
Numeric. Fill value for masked entries during training. Default
0.0
.- columns_ignore
Character or integer vector. Columns to exclude from validation set. Can specify by name or index. Default
NULL
.- print_dataset
Logical. If
TRUE
, prints dataset summary information during processing. DefaultTRUE
.- clusters
Optional vector or single-column data.frame of precomputed cluster labels for samples. If
NULL
, clustering will be performed automatically. DefaultNULL
.- n_clusters
Integer. Number of clusters for KMeans clustering when
clusters
isNULL
. Number of clusters for KMeans clustering when 'clusters' is NULL. IfNULL
, will use HDBSCAN for clustering. DefaultNULL
.- seed
Integer. Random seed for reproducible results. Default
42
.- missingness_proportion_matrix
Optional pre-computed missingness proportion matrix for biomarker-based clustering. If provided, clustering will be based on these proportions. Default
NULL
.- scale_features
Logical. Whether to scale features when using missingness proportion matrix clustering. Default
FALSE
.Integer vector. Sizes of hidden layers in encoder/decoder. Length determines number of hidden layers. Default
c(150, 120, 60)
.- latent_dim
Integer. Dimension of latent space representation. Default
15
.- layer_order_enc
Character vector. Sharing pattern for encoder layers. Each element should be "shared" or "unshared". Length must match
length(hidden_dims)
. Defaultc("unshared", "unshared", "unshared")
.- layer_order_dec
Character vector. Sharing pattern for decoder layers. Each element should be "shared" or "unshared". Length must match
length(hidden_dims)
. Defaultc("shared", "shared", "shared")
.Logical. Whether latent space weights are shared across clusters. Default
FALSE
.Logical. Whether output layer weights are shared across clusters. Default
FALSE
.- batch_size
Integer. Mini-batch size for training. Larger values may improve training stability but require more memory. Default
4000
.- epochs
Integer. Number of epochs for initial training phase. Default
500
.- initial_lr
Numeric. Initial learning rate for optimizer. Default
0.01
.- decay_factor
Numeric. Exponential decay factor for learning rate scheduling. Must be between 0 and 1. Default
0.999
.- beta
Numeric. Weight for KL divergence term in VAE loss function. Controls regularization strength. Default
0.001
.- device
Character. Device specification for computation ("cpu" or "cuda"). If
NULL
, automatically selects best available device. DefaultNULL
.- max_loops
Integer. Maximum number of impute-refit loops to perform. Default
100
.- patience
Integer. Early stopping patience for refit loops. Training stops if validation loss doesn't improve for this many consecutive loops. Default
2
.- epochs_per_loop
Integer. Number of epochs per refit loop. If
NULL
, uses same value asepochs
. DefaultNULL
.- initial_lr_refit
Numeric. Learning rate for refit loops. If
NULL
, uses same value asinitial_lr
. DefaultNULL
.- decay_factor_refit
Numeric. Decay factor for refit loops. If
NULL
, uses same value asdecay_factor
. DefaultNULL
.- beta_refit
Numeric. KL weight for refit loops. If
NULL
, uses same value asbeta
. DefaultNULL
.- verbose
Logical. If
TRUE
, prints detailed progress information during training. DefaultFALSE
.- return_model
Logical. If
TRUE
, returns the trained Python VAE model object. DefaultTRUE
.- return_silhouettes
Logical. If
TRUE
, returns silhouette scores for cluster quality assessment. DefaultFALSE
.- return_history
Logical. If
TRUE
, returns training history as a data.frame containing loss values and metrics over epochs. DefaultFALSE
.- return_dataset
Logical. If
TRUE
, returns the ClusterDataset object used during training (contains validation data, masks, etc.). DefaultFALSE
.- cluster_selection_epsilon
Numeric. Epsilon parameter for HDBSCAN clustering when automatic clustering is used. Default
0.25
.
Value
A list containing imputed data and optional additional outputs:
- imputed
data.frame of imputed data with same dimensions as input. Missing values are filled with model predictions. If
index_col
was provided, it is re-attached as the first column.- model
(if
return_model=TRUE
) Python CISSVAE model object. Can be used for further analysis or predictions.- dataset
(if
return_dataset=TRUE
) Python ClusterDataset object containing validation data, masks, normalization parameters, and cluster labels. Can be used with performance_by_cluster() and other analysis functions.- silhouettes
(if
return_silhouettes=TRUE
) Numeric silhouette score measuring cluster separation quality.- history
(if
return_history=TRUE
) data.frame containing training history with columns for epoch, losses, and validation metrics.
Details
The CISS-VAE method works in two main phases:
Initial Training: The model is trained on the original data with validation holdout to learn initial representations and imputation patterns.
Impute-Refit Loops: The model iteratively imputes missing values and retrains on the updated dataset until convergence or maximum loops reached.
The architecture uses both shared and cluster-specific layers to capture:
Shared patterns: Common relationships across all clusters
Specific patterns: Unique relationships within each cluster
Requirements
This function requires the Python ciss_vae
package to be installed and
accessible via reticulate
. The package handles automatic device selection
(CPU/GPU) based on availability.
Performance Tips
Use GPU computation when available for faster training on large datasets
Adjust
batch_size
based on available memory (larger = faster but more memory)Start with default hyperparameters and adjust based on validation performance
Use
verbose=TRUE
to monitor training progress on large datasets
See also
create_missingness_prop_matrix
for creating missingness proportion matrices
performance_by_cluster
for analyzing model performance using the returned dataset
Examples
if (FALSE) { # \dontrun{
# Basic usage with automatic clustering
result <- run_cissvae(
data = my_data_with_missing,
index_col = "sample_id"
)
imputed_data <- result$imputed
# Advanced usage with dataset for performance analysis
result <- run_cissvae(
data = my_data,
clusters = my_cluster_labels,
hidden_dims = c(200, 150, 100),
latent_dim = 20,
epochs = 1000,
return_history = TRUE,
return_silhouettes = TRUE,
return_dataset = TRUE,
verbose = TRUE
)
# Access different outputs
imputed_data <- result$imputed
training_history <- result$history
cluster_quality <- result$silhouettes
# Use dataset for performance analysis
perf <- performance_by_cluster(
original_data = my_data,
model = result$model,
dataset = result$dataset,
clusters = my_cluster_labels
)
# Using pre-computed missingness matrix for clustering
prop_matrix <- create_missingness_prop_matrix(
data = my_data,
index_col = "sample_id"
)
result <- run_cissvae(
data = my_data,
index_col = "sample_id",
missingness_proportion_matrix = prop_matrix,
scale_features = TRUE,
return_dataset = TRUE
)
# Custom layer sharing patterns
result <- run_cissvae(
data = my_data,
hidden_dims = c(100, 80, 60),
layer_order_enc = c("unshared", "shared", "shared"),
layer_order_dec = c("shared", "shared", "unshared"),
latent_shared = TRUE
)
} # }