Run the CISS-VAE pipeline for missing data imputation

This function wraps the Python run_cissvae function from the ciss_vae module, providing a complete pipeline for missing data imputation using a Cluster-Informed Shared and Specific Variational Autoencoder (CISS-VAE). The function handles data preprocessing, model training, and returns imputed data along with optional model artifacts.

The CISS-VAE architecture uses cluster information to learn both shared and cluster-specific representations, enabling more accurate imputation by leveraging patterns within and across different data subgroups.

Usage

run_cissvae(
  data,
  index_col = NULL,
  val_proportion = 0.1,
  replacement_value = 0,
  columns_ignore = NULL,
  print_dataset = TRUE,
  clusters = NULL,
  n_clusters = NULL,
  seed = 42,
  missingness_proportion_matrix = NULL,
  scale_features = FALSE,
  k_neighbors = 15L,
  leiden_resolution = 0.5,
  leiden_objective = "CPM",
  hidden_dims = c(150, 120, 60),
  latent_dim = 15,
  layer_order_enc = c("unshared", "unshared", "unshared"),
  layer_order_dec = c("shared", "shared", "shared"),
  latent_shared = FALSE,
  output_shared = FALSE,
  batch_size = 4000,
  epochs = 500,
  initial_lr = 0.01,
  decay_factor = 0.999,
  beta = 0.001,
  device = NULL,
  max_loops = 100,
  patience = 2,
  epochs_per_loop = NULL,
  initial_lr_refit = NULL,
  decay_factor_refit = NULL,
  beta_refit = NULL,
  verbose = FALSE,
  return_model = TRUE,
  return_clusters = FALSE,
  return_silhouettes = FALSE,
  return_history = FALSE,
  return_dataset = FALSE,
  do_not_impute_matrix = NULL,
  debug = FALSE
)

Arguments

data: A data.frame or matrix (samples × features) containing the data to impute. May contain NA values which will be imputed.
index_col: Character. Name of column in data to treat as sample identifier. This column will be removed before training and re-attached to results. Default NULL.
val_proportion: Numeric. Fraction of non-missing entries to hold out for validation during training. Must be between 0 and 1. Default 0.1.
replacement_value: Numeric. Fill value for masked entries during training. Default 0.0.
columns_ignore: Character or integer vector. Columns to exclude from validation set. Can specify by name or index. Default NULL.
print_dataset: Logical. If TRUE, prints dataset summary information during processing. Default TRUE.
clusters: Optional vector or single-column data.frame of precomputed cluster labels for samples. If NULL, clustering will be performed automatically. Default NULL.
n_clusters: Integer. Number of clusters for KMeans clustering when clusters is NULL. Number of clusters for KMeans clustering when 'clusters' is NULL. If NULL, will use HDBSCAN for clustering. Default NULL.
seed: Integer. Random seed for reproducible results. Default 42.
missingness_proportion_matrix: Optional pre-computed missingness proportion matrix for biomarker-based clustering. If provided, clustering will be based on these proportions. Default NULL.
scale_features: Logical. Whether to scale features when using missingness proportion matrix clustering. Default FALSE.
hidden_dims: Integer vector. Sizes of hidden layers in encoder/decoder. Length determines number of hidden layers. Default c(150, 120, 60).
latent_dim: Integer. Dimension of latent space representation. Default 15.
layer_order_enc: Character vector. Sharing pattern for encoder layers. Each element should be "shared" or "unshared". Length must match length(hidden_dims). Default c("unshared", "unshared", "unshared").
layer_order_dec: Character vector. Sharing pattern for decoder layers. Each element should be "shared" or "unshared". Length must match length(hidden_dims). Default c("shared", "shared", "shared").
latent_shared: Logical. Whether latent space weights are shared across clusters. Default FALSE.
output_shared: Logical. Whether output layer weights are shared across clusters. Default FALSE.
batch_size: Integer. Mini-batch size for training. Larger values may improve training stability but require more memory. Default 4000.
epochs: Integer. Number of epochs for initial training phase. Default 500.
initial_lr: Numeric. Initial learning rate for optimizer. Default 0.01.
decay_factor: Numeric. Exponential decay factor for learning rate scheduling. Must be between 0 and 1. Default 0.999.
beta: Numeric. Weight for KL divergence term in VAE loss function. Controls regularization strength. Default 0.001.
device: Character. Device specification for computation ("cpu" or "cuda"). If NULL, automatically selects best available device. Default NULL.
max_loops: Integer. Maximum number of impute-refit loops to perform. Default 100.
patience: Integer. Early stopping patience for refit loops. Training stops if validation loss doesn't improve for this many consecutive loops. Default 2.
epochs_per_loop: Integer. Number of epochs per refit loop. If NULL, uses same value as epochs. Default NULL.
initial_lr_refit: Numeric. Learning rate for refit loops. If NULL, uses same value as initial_lr. Default NULL.
decay_factor_refit: Numeric. Decay factor for refit loops. If NULL, uses same value as decay_factor. Default NULL.
beta_refit: Numeric. KL weight for refit loops. If NULL, uses same value as beta. Default NULL.
verbose: Logical. If TRUE, prints detailed progress information during training. Default FALSE.
return_model: Logical. If TRUE, returns the trained Python VAE model object. Default TRUE.
return_silhouettes: Logical. If TRUE, returns silhouette scores for cluster quality assessment. Default FALSE.
return_history: Logical. If TRUE, returns training history as a data.frame containing loss values and metrics over epochs. Default FALSE.
return_dataset: Logical. If TRUE, returns the ClusterDataset object used during training (contains validation data, masks, etc.). Default FALSE.
cluster_selection_epsilon: Numeric. Epsilon parameter for HDBSCAN clustering when automatic clustering is used. Default 0.25.

Value

A list containing imputed data and optional additional outputs:

imputed: data.frame of imputed data with same dimensions as input. Missing values are filled with model predictions. If index_col was provided, it is re-attached as the first column.
model: (if return_model=TRUE) Python CISSVAE model object. Can be used for further analysis or predictions.
dataset: (if return_dataset=TRUE) Python ClusterDataset object containing validation data, masks, normalization parameters, and cluster labels. Can be used with performance_by_cluster() and other analysis functions.
silhouettes: (if return_silhouettes=TRUE) Numeric silhouette score measuring cluster separation quality.
history: (if return_history=TRUE) data.frame containing training history with columns for epoch, losses, and validation metrics.

Details

The CISS-VAE method works in two main phases:

Initial Training: The model is trained on the original data with validation holdout to learn initial representations and imputation patterns.
Impute-Refit Loops: The model iteratively imputes missing values and retrains on the updated dataset until convergence or maximum loops reached.

The architecture uses both shared and cluster-specific layers to capture:

Shared patterns: Common relationships across all clusters
Specific patterns: Unique relationships within each cluster

Requirements

This function requires the Python ciss_vae package to be installed and accessible via reticulate. The package handles automatic device selection (CPU/GPU) based on availability.

Performance Tips

Use GPU computation when available for faster training on large datasets
Adjust batch_size based on available memory (larger = faster but more memory)
Start with default hyperparameters and adjust based on validation performance
Use verbose=TRUE to monitor training progress on large datasets

Examples

if (FALSE) { # \dontrun{
# Basic usage with automatic clustering
result <- run_cissvae(
  data = my_data_with_missing,
  index_col = "sample_id"
)
imputed_data <- result$imputed

# Advanced usage with dataset for performance analysis
result <- run_cissvae(
  data = my_data,
  clusters = my_cluster_labels,
  hidden_dims = c(200, 150, 100),
  latent_dim = 20,
  epochs = 1000,
  return_history = TRUE,
  return_silhouettes = TRUE,
  return_dataset = TRUE,
  verbose = TRUE
)

# Access different outputs
imputed_data <- result$imputed
training_history <- result$history
cluster_quality <- result$silhouettes

# Use dataset for performance analysis
perf <- performance_by_cluster(
  original_data = my_data,
  model = result$model,
  dataset = result$dataset,
  clusters = my_cluster_labels
)

# Using pre-computed missingness matrix for clustering
prop_matrix <- create_missingness_prop_matrix(
  data = my_data, 
  index_col = "sample_id"
)
result <- run_cissvae(
  data = my_data,
  index_col = "sample_id",
  missingness_proportion_matrix = prop_matrix,
  scale_features = TRUE,
  return_dataset = TRUE
)

# Custom layer sharing patterns
result <- run_cissvae(
  data = my_data,
  hidden_dims = c(100, 80, 60),
  layer_order_enc = c("unshared", "shared", "shared"),
  layer_order_dec = c("shared", "shared", "unshared"),
  latent_shared = TRUE
)
} # }