Skip to contents

This function wraps the Python run_cissvae function from the ciss_vae package, providing a complete pipeline for missing data imputation using a Cluster-Informed Shared and Specific Variational Autoencoder (CISS-VAE). The function handles data preprocessing, model training, and returns imputed data along with optional model artifacts.

The CISS-VAE architecture uses cluster information to learn both shared and cluster-specific representations, enabling more accurate imputation by leveraging patterns within and across different data subgroups.

Usage

run_cissvae(
  data,
  index_col = NULL,
  val_proportion = 0.1,
  replacement_value = 0,
  columns_ignore = NULL,
  imputable_matrix = NULL,
  binary_feature_mask = NULL,
  print_dataset = TRUE,
  clusters = NULL,
  n_clusters = NULL,
  seed = 42,
  missingness_proportion_matrix = NULL,
  scale_features = FALSE,
  k_neighbors = 15L,
  leiden_resolution = 0.5,
  leiden_objective = "CPM",
  hidden_dims = c(150, 120, 60),
  latent_dim = 15,
  layer_order_enc = c("unshared", "unshared", "unshared"),
  layer_order_dec = c("shared", "shared", "shared"),
  latent_shared = FALSE,
  output_shared = FALSE,
  batch_size = 4000,
  epochs = 500,
  initial_lr = 0.01,
  decay_factor = 0.999,
  weight_decay = 0.001,
  beta = 0.001,
  device = NULL,
  max_loops = 100,
  patience = 2,
  epochs_per_loop = NULL,
  initial_lr_refit = NULL,
  decay_factor_refit = NULL,
  beta_refit = NULL,
  verbose = FALSE,
  return_model = TRUE,
  return_clusters = FALSE,
  return_silhouettes = FALSE,
  return_history = FALSE,
  return_dataset = FALSE,
  return_validation_dataset = FALSE,
  debug = FALSE
)

Arguments

data

A data.frame or matrix (samples × features) containing the data to impute. May contain NA values which will be imputed.

index_col

Character. Name of column in data to treat as sample identifier. This column will be removed before training and re-attached to results. Default NULL.

val_proportion

Numeric. Fraction of non-missing entries to hold out for validation during training. Must be between 0 and 1. Default 0.1.

replacement_value

Numeric. Fill value for masked entries during training. Default 0.0.

columns_ignore

Character or integer vector. Columns to exclude from validation set. Can specify by name or index. Default NULL.

imputable_matrix

Logical matrix indicating entries allowed to be imputed.

binary_feature_mask

Logical vector marking which columns are binary.

print_dataset

Logical. If TRUE, prints dataset summary information during processing. Default TRUE.

clusters

Optional vector or single-column data.frame of precomputed cluster labels for samples. If NULL, clustering will be performed automatically. Default NULL.

n_clusters

Integer. Number of clusters for KMeans clustering when clusters is NULL. Number of clusters for KMeans clustering when 'clusters' is NULL. If NULL, will use Leiden for clustering. Default NULL.

seed

Integer. Random seed for reproducible results. Default 42.

missingness_proportion_matrix

Optional pre-computed missingness proportion matrix for biomarker-based clustering. If provided, clustering will be based on these proportions. Default NULL.

scale_features

Logical. Whether to scale features when using missingness proportion matrix clustering. Default FALSE.

k_neighbors

Integer. Number of nearest neighbors for Leiden clustering. Defaults to 15.

leiden_resolution

Float. Resolution parameter for Leiden clustering. Defaults to 0.5.

leiden_objective

Character. Objective function for Leiden clustering. One of ("CPM", "RB", "Modularity")

hidden_dims

Integer vector. Sizes of hidden layers in encoder/decoder. Length determines number of hidden layers. Default c(150, 120, 60).

latent_dim

Integer. Dimension of latent space representation. Default 15.

layer_order_enc

Character vector. Sharing pattern for encoder layers. Each element should be "shared" or "unshared". Length must match length(hidden_dims). Default c("unshared", "unshared", "unshared").

layer_order_dec

Character vector. Sharing pattern for decoder layers. Each element should be "shared" or "unshared". Length must match length(hidden_dims). Default c("shared", "shared", "shared").

latent_shared

Logical. Whether latent space weights are shared across clusters. Default FALSE.

output_shared

Logical. Whether output layer weights are shared across clusters. Default FALSE.

batch_size

Integer. Mini-batch size for training. Larger values may improve training stability but require more memory. Default 4000.

epochs

Integer. Number of epochs for initial training phase. Default 500.

initial_lr

Numeric. Initial learning rate for optimizer. Default 0.01.

decay_factor

Numeric. Exponential decay factor for learning rate scheduling. Must be between 0 and 1. Default 0.999.

weight_decay

Weight decay (L2 penalty) used in Adam optimizer.

beta

Numeric. Weight for KL divergence term in VAE loss function. Controls regularization strength. Default 0.001.

device

Character. Device specification for computation ("cpu" or "cuda"). If NULL, automatically selects best available device. Default NULL.

max_loops

Integer. Maximum number of impute-refit loops to perform. Default 100.

patience

Integer. Early stopping patience for refit loops. Training stops if validation loss doesn't improve for this many consecutive loops. Default 2.

epochs_per_loop

Integer. Number of epochs per refit loop. If NULL, uses same value as epochs. Default NULL.

initial_lr_refit

Numeric. Learning rate for refit loops. If NULL, uses same value as initial_lr. Default NULL.

decay_factor_refit

Numeric. Decay factor for refit loops. If NULL, uses same value as decay_factor. Default NULL.

beta_refit

Numeric. KL weight for refit loops. If NULL, uses same value as beta. Default NULL.

verbose

Logical. If TRUE, prints detailed progress information during training. Default FALSE.

return_model

Logical. If TRUE, returns the trained Python VAE model object. Default TRUE.

return_clusters

Logical. If TRUE returns cluster vector

return_silhouettes

Logical. If TRUE, returns silhouette scores for cluster quality assessment. Default FALSE.

return_history

Logical. If TRUE, returns training history as a data.frame containing loss values and metrics over epochs. Default FALSE.

return_dataset

Logical. If TRUE, returns the ClusterDataset object used during training (contains validation data, masks, etc.). Default FALSE.

return_validation_dataset

Logical. If TRUE returns validation dataset

debug

Logical; if TRUE, additional metadata is returned for debugging.

Value

A list containing imputed data and optional additional outputs:

imputed_dataset

data.frame of imputed data with same dimensions as input. Missing values are filled with model predictions. If index_col was provided, it is re-attached as the first column.

model

(if return_model=TRUE) Python CISSVAE model object. Can be used for further analysis or predictions.

cluster_dataset

(if return_dataset=TRUE) Python ClusterDataset object containing validation data, masks, normalization parameters, and cluster labels. Can be used with performance_by_cluster() and other analysis functions.

clusters

(if return_clusters=TRUE) Returns vector of cluster assignments

silhouettes

(if return_silhouettes=TRUE) Numeric silhouette score measuring cluster separation quality.

training_history

(if return_history=TRUE) data.frame containing training history with columns for epoch, losses, and validation metrics.

val_data

(if return_validation_dataset=TRUE) data.frame containing values held aside for validation.

val_imputed

(if return_validation_dataset=TRUE) data.frame containing imputed values of set held aside for validation.

Details

The CISS-VAE method works in two main phases:

  1. Initial Training: The model is trained on the original data with validation holdout to learn initial representations and imputation patterns.

  2. Impute-Refit Loops: The model iteratively imputes missing values and retrains on the updated dataset until convergence or maximum loops reached.

The architecture uses both shared and cluster-specific layers to capture:

  • Shared patterns: Common relationships across all clusters

  • Specific patterns: Unique relationships within each cluster

Requirements

This function requires the Python ciss_vae package to be installed and accessible via reticulate.

Performance tips

  • If Leiden clustering yields too many clusters, consider increasing k_neighbors or reducing leiden_resolution.

  • Use GPU computation when available for faster training on large datasets. Use check_devices() to see what devices are available.

  • Adjust batch_size based on available memory (larger is faster but uses more memory).

  • Set verbose = TRUE to monitor training progress.

See also

create_missingness_prop_matrix for creating missingness proportion matrices performance_by_cluster for analyzing model performance using the returned dataset

Examples

# \donttest{
## Requires a working Python environment via reticulate
## Examples are wrapped in try() to avoid failures on CRAN check systems
library(rCISSVAE)

data(df_missing)
data(clusters)

try({
dat = run_cissvae(
 data = df_missing,
 index_col = "index",
 val_proportion = 0.1, ## pass a vector for different proportions by cluster
 columns_ignore = c("Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003"), 
 clusters = clusters$clusters, ## we have precomputed cluster labels so we pass them here
 epochs = 5,
 return_silhouettes = FALSE,
 return_history = TRUE,  # Get detailed training history
 verbose = FALSE,
 return_model = TRUE, ## Allows for plotting model schematic
 device = "cpu",  # Explicit device selection
 layer_order_enc = c("unshared", "shared", "unshared"),
 layer_order_dec = c("shared", "unshared", "shared")
)
})
#> Error in py_module_import(module, convert = convert) : 
#>   ModuleNotFoundError: No module named 'ciss_vae'
#> Run `reticulate::py_last_error()` for details.
# }