Cluster Samples Based on Missingness Proportions — cluster_on_missing

Groups samples with similar patterns of missingness across features using either K-means clustering (when n_clusters is specified) or HDBSCAN (when n_clusters is NULL). This is useful for detecting cohorts with shared missing-data behavior (e.g., site/batch effects).

Usage

cluster_on_missing_prop(
  prop_matrix,
  n_clusters = NULL,
  seed = NULL,
  min_cluster_size = NULL,
  cluster_selection_epsilon = 0.25,
  metric = "euclidean",
  scale_features = FALSE,
  handle_noise = "keep"
)

Arguments

prop_matrix: Matrix or data frame where rows are samples and columns are features, entries are missingness proportions in [0,1]. Can be created with create_missingness_prop_matrix().
n_clusters: Integer; number of clusters for KMeans. If NULL, uses HDBSCAN (default: NULL).
seed: Integer; random seed for KMeans reproducibility (default: NULL).
min_cluster_size: Integer; HDBSCAN minimum cluster size. If NULL, Python default is used (typically a function of the number of samples) (default: NULL).
cluster_selection_epsilon: Numeric; HDBSCAN cluster selection threshold (default: 0.25).
metric: Character; distance metric "euclidean" or "cosine" (default: "euclidean").
scale_features: Logical; whether to standardize feature columns before clustering samples (default: FALSE).
handle_noise: Character; how to handle HDBSCAN noise points (-1): "keep" (each noise sample gets its own new cluster ID), "separate" (all noise samples share one new ID), or "merge" (noise samples assigned to largest existing cluster) (default: "keep").

Value

A list with:

clusters: Integer vector of cluster assignments per sample (may include -1 for HDBSCAN noise).
clusters_positive: Integer vector with all labels non-negative after applying handle_noise.
silhouette_score: Numeric silhouette score, or NULL if not computable.
sample_names: Character vector of sample names corresponding to clusters.
n_samples: Integer; number of samples (rows).
n_clusters_found: Integer; number of clusters found (excluding noise).
n_clusters_final: Integer; final number of clusters after noise handling.
n_noise: Integer; number of samples assigned to noise (HDBSCAN only).
handle_noise: The noise handling mode used.

Examples

set.seed(123)
dat <- data.frame(
  sample_id = paste0("s", 1:12),
  # Two features measured at 3 timepoints each → proportions by feature per sample
  A_1 = c(NA, rnorm(11)), A_2 = c(NA, rnorm(11)), A_3 = rnorm(12),
  B_1 = rnorm(12),        B_2 = c(rnorm(10), NA, NA), B_3 = rnorm(12)
)
pm <- create_missingness_prop_matrix(dat, index_col = "sample_id",
                                     repeat_feature_names = c("A","B"))
res <- cluster_on_missing_prop(pm, n_clusters = 2, metric = "cosine", scale_features = TRUE)
#> Error in py_get_attr(x, name, FALSE): AttributeError: module 'ciss_vae.utils.run_cissvae' has no attribute 'cluster_on_missing_prop'
#> Run `reticulate::py_last_error()` for details.
table(res$clusters_positive)
#> Error: object 'res' not found
res$silhouette_score
#> Error: object 'res' not found