Skip to contents

This function wraps the Python run_cissvae function from the ciss_vae module, providing a complete pipeline for missing data imputation using a Cluster-Informed Shared and Specific Variational Autoencoder (CISS-VAE). The function handles data preprocessing, model training, and returns imputed data along with optional model artifacts.

The CISS-VAE architecture uses cluster information to learn both shared and cluster-specific representations, enabling more accurate imputation by leveraging patterns within and across different data subgroups.

Usage

run_cissvae(
  data,
  index_col = NULL,
  val_proportion = 0.1,
  replacement_value = 0,
  columns_ignore = NULL,
  print_dataset = TRUE,
  clusters = NULL,
  n_clusters = NULL,
  seed = 42,
  missingness_proportion_matrix = NULL,
  scale_features = FALSE,
  k_neighbors = 15L,
  leiden_resolution = 0.5,
  leiden_objective = "CPM",
  hidden_dims = c(150, 120, 60),
  latent_dim = 15,
  layer_order_enc = c("unshared", "unshared", "unshared"),
  layer_order_dec = c("shared", "shared", "shared"),
  latent_shared = FALSE,
  output_shared = FALSE,
  batch_size = 4000,
  epochs = 500,
  initial_lr = 0.01,
  decay_factor = 0.999,
  beta = 0.001,
  device = NULL,
  max_loops = 100,
  patience = 2,
  epochs_per_loop = NULL,
  initial_lr_refit = NULL,
  decay_factor_refit = NULL,
  beta_refit = NULL,
  verbose = FALSE,
  return_model = TRUE,
  return_clusters = FALSE,
  return_silhouettes = FALSE,
  return_history = FALSE,
  return_dataset = FALSE,
  do_not_impute_matrix = NULL,
  debug = FALSE
)

Arguments

data

A data.frame or matrix (samples × features) containing the data to impute. May contain NA values which will be imputed.

index_col

Character. Name of column in data to treat as sample identifier. This column will be removed before training and re-attached to results. Default NULL.

val_proportion

Numeric. Fraction of non-missing entries to hold out for validation during training. Must be between 0 and 1. Default 0.1.

replacement_value

Numeric. Fill value for masked entries during training. Default 0.0.

columns_ignore

Character or integer vector. Columns to exclude from validation set. Can specify by name or index. Default NULL.

print_dataset

Logical. If TRUE, prints dataset summary information during processing. Default TRUE.

clusters

Optional vector or single-column data.frame of precomputed cluster labels for samples. If NULL, clustering will be performed automatically. Default NULL.

n_clusters

Integer. Number of clusters for KMeans clustering when clusters is NULL. Number of clusters for KMeans clustering when 'clusters' is NULL. If NULL, will use HDBSCAN for clustering. Default NULL.

seed

Integer. Random seed for reproducible results. Default 42.

missingness_proportion_matrix

Optional pre-computed missingness proportion matrix for biomarker-based clustering. If provided, clustering will be based on these proportions. Default NULL.

scale_features

Logical. Whether to scale features when using missingness proportion matrix clustering. Default FALSE.

hidden_dims

Integer vector. Sizes of hidden layers in encoder/decoder. Length determines number of hidden layers. Default c(150, 120, 60).

latent_dim

Integer. Dimension of latent space representation. Default 15.

layer_order_enc

Character vector. Sharing pattern for encoder layers. Each element should be "shared" or "unshared". Length must match length(hidden_dims). Default c("unshared", "unshared", "unshared").

layer_order_dec

Character vector. Sharing pattern for decoder layers. Each element should be "shared" or "unshared". Length must match length(hidden_dims). Default c("shared", "shared", "shared").

latent_shared

Logical. Whether latent space weights are shared across clusters. Default FALSE.

output_shared

Logical. Whether output layer weights are shared across clusters. Default FALSE.

batch_size

Integer. Mini-batch size for training. Larger values may improve training stability but require more memory. Default 4000.

epochs

Integer. Number of epochs for initial training phase. Default 500.

initial_lr

Numeric. Initial learning rate for optimizer. Default 0.01.

decay_factor

Numeric. Exponential decay factor for learning rate scheduling. Must be between 0 and 1. Default 0.999.

beta

Numeric. Weight for KL divergence term in VAE loss function. Controls regularization strength. Default 0.001.

device

Character. Device specification for computation ("cpu" or "cuda"). If NULL, automatically selects best available device. Default NULL.

max_loops

Integer. Maximum number of impute-refit loops to perform. Default 100.

patience

Integer. Early stopping patience for refit loops. Training stops if validation loss doesn't improve for this many consecutive loops. Default 2.

epochs_per_loop

Integer. Number of epochs per refit loop. If NULL, uses same value as epochs. Default NULL.

initial_lr_refit

Numeric. Learning rate for refit loops. If NULL, uses same value as initial_lr. Default NULL.

decay_factor_refit

Numeric. Decay factor for refit loops. If NULL, uses same value as decay_factor. Default NULL.

beta_refit

Numeric. KL weight for refit loops. If NULL, uses same value as beta. Default NULL.

verbose

Logical. If TRUE, prints detailed progress information during training. Default FALSE.

return_model

Logical. If TRUE, returns the trained Python VAE model object. Default TRUE.

return_silhouettes

Logical. If TRUE, returns silhouette scores for cluster quality assessment. Default FALSE.

return_history

Logical. If TRUE, returns training history as a data.frame containing loss values and metrics over epochs. Default FALSE.

return_dataset

Logical. If TRUE, returns the ClusterDataset object used during training (contains validation data, masks, etc.). Default FALSE.

cluster_selection_epsilon

Numeric. Epsilon parameter for HDBSCAN clustering when automatic clustering is used. Default 0.25.

Value

A list containing imputed data and optional additional outputs:

imputed

data.frame of imputed data with same dimensions as input. Missing values are filled with model predictions. If index_col was provided, it is re-attached as the first column.

model

(if return_model=TRUE) Python CISSVAE model object. Can be used for further analysis or predictions.

dataset

(if return_dataset=TRUE) Python ClusterDataset object containing validation data, masks, normalization parameters, and cluster labels. Can be used with performance_by_cluster() and other analysis functions.

silhouettes

(if return_silhouettes=TRUE) Numeric silhouette score measuring cluster separation quality.

history

(if return_history=TRUE) data.frame containing training history with columns for epoch, losses, and validation metrics.

Details

The CISS-VAE method works in two main phases:

  1. Initial Training: The model is trained on the original data with validation holdout to learn initial representations and imputation patterns.

  2. Impute-Refit Loops: The model iteratively imputes missing values and retrains on the updated dataset until convergence or maximum loops reached.

The architecture uses both shared and cluster-specific layers to capture:

  • Shared patterns: Common relationships across all clusters

  • Specific patterns: Unique relationships within each cluster

Requirements

This function requires the Python ciss_vae package to be installed and accessible via reticulate. The package handles automatic device selection (CPU/GPU) based on availability.

Performance Tips

  • Use GPU computation when available for faster training on large datasets

  • Adjust batch_size based on available memory (larger = faster but more memory)

  • Start with default hyperparameters and adjust based on validation performance

  • Use verbose=TRUE to monitor training progress on large datasets

See also

create_missingness_prop_matrix for creating missingness proportion matrices performance_by_cluster for analyzing model performance using the returned dataset

Examples

if (FALSE) { # \dontrun{
# Basic usage with automatic clustering
result <- run_cissvae(
  data = my_data_with_missing,
  index_col = "sample_id"
)
imputed_data <- result$imputed

# Advanced usage with dataset for performance analysis
result <- run_cissvae(
  data = my_data,
  clusters = my_cluster_labels,
  hidden_dims = c(200, 150, 100),
  latent_dim = 20,
  epochs = 1000,
  return_history = TRUE,
  return_silhouettes = TRUE,
  return_dataset = TRUE,
  verbose = TRUE
)

# Access different outputs
imputed_data <- result$imputed
training_history <- result$history
cluster_quality <- result$silhouettes

# Use dataset for performance analysis
perf <- performance_by_cluster(
  original_data = my_data,
  model = result$model,
  dataset = result$dataset,
  clusters = my_cluster_labels
)

# Using pre-computed missingness matrix for clustering
prop_matrix <- create_missingness_prop_matrix(
  data = my_data, 
  index_col = "sample_id"
)
result <- run_cissvae(
  data = my_data,
  index_col = "sample_id",
  missingness_proportion_matrix = prop_matrix,
  scale_features = TRUE,
  return_dataset = TRUE
)

# Custom layer sharing patterns
result <- run_cissvae(
  data = my_data,
  hidden_dims = c(100, 80, 60),
  layer_order_enc = c("unshared", "shared", "shared"),
  layer_order_dec = c("shared", "shared", "unshared"),
  latent_shared = TRUE
)
} # }