Skip to contents

Wraps the Python run_cissvae function from the ciss_vae module, handles an optional index_col, and returns imputed data, and optionally the model and silhouette scores.

Usage

run_cissvae(
  data,
  index_col = NULL,
  val_percent = 0.1,
  replacement_value = 0,
  columns_ignore = NULL,
  print_dataset = TRUE,
  clusters = NULL,
  n_clusters = NULL,
  cluster_selection_epsilon = 0.25,
  seed = 42,
  hidden_dims = c(150, 120, 60),
  latent_dim = 15,
  layer_order_enc = c("unshared", "unshared", "unshared"),
  layer_order_dec = c("shared", "shared", "shared"),
  latent_shared = FALSE,
  output_shared = FALSE,
  batch_size = 4000,
  return_model = TRUE,
  epochs = 500,
  initial_lr = 0.01,
  decay_factor = 0.999,
  beta = 0.001,
  max_loops = 100,
  patience = 2,
  epochs_per_loop = NULL,
  initial_lr_refit = NULL,
  decay_factor_refit = NULL,
  beta_refit = NULL,
  verbose = FALSE,
  return_silhouettes = FALSE
)

Arguments

data

A data.frame or matrix (samples × features), may contain NA.

index_col

Character. Column in data to treat as sample ID; removed before training and re-attached. Default NULL.

val_percent

Numeric fraction of non-missing entries to hold out. Default 0.1.

replacement_value

Numeric fill value for masked entries. Default 0.0.

columns_ignore

Character or integer vector of columns to ignore. Default NULL.

print_dataset

Logical; if TRUE, prints dataset summary. Default TRUE.

clusters

Optional vector (or single-column data.frame) of precomputed cluster labels. Default NULL.

n_clusters

Integer for KMeans if clusters is NULL. Default NULL.

cluster_selection_epsilon

Numeric epsilon for HDBSCAN. Default 0.25.

seed

Integer random seed. Default 42.

hidden_dims

Integer vector of hidden layer sizes. Default c(150,120,60).

latent_dim

Integer latent space dimension. Default 15.

layer_order_enc

Character vector for encoder layer sharing. Default c("unshared","unshared","unshared").

layer_order_dec

Character vector for decoder layer sharing. Default c("shared","shared","shared").

latent_shared

Logical; share latent weights? Default FALSE.

output_shared

Logical; share output weights? Default FALSE.

batch_size

Integer batch size. Default 4000.

return_model

Logical; if TRUE, returns Python model. Default TRUE.

epochs

Integer initial training epochs. Default 500.

initial_lr

Numeric initial learning rate. Default 0.01.

decay_factor

Numeric learning rate decay. Default 0.999.

beta

Numeric KL weight. Default 0.001.

max_loops

Integer max refit loops. Default 100.

patience

Integer early stop patience. Default 2.

epochs_per_loop

Integer epochs per refit loop. Default NULL (uses epochs).

initial_lr_refit

Numeric LR for refit loops. Default NULL.

decay_factor_refit

Numeric decay for refit loops. Default NULL.

beta_refit

Numeric KL weight for refit loops. Default NULL.

verbose

Logical; if TRUE, prints progress. Default FALSE.

return_silhouettes

Logical; if TRUE, returns silhouette scores. Default FALSE.

Value

A list with elements:

  • imputed: data.frame of imputed values (with index_col re-attached).

  • model: Python VAE object (if return_model = TRUE).

  • silhouettes: numeric vector (if return_silhouettes = TRUE).