Skip to contents

Groups samples with similar patterns of missingness across features using either K-means clustering (when n_clusters is specified) or HDBSCAN (when n_clusters is NULL). This is useful for detecting cohorts with shared missing-data behavior (e.g., site/batch effects).

Usage

cluster_on_missing_prop(
  prop_matrix,
  n_clusters = NULL,
  seed = NULL,
  min_cluster_size = NULL,
  cluster_selection_epsilon = 0.25,
  metric = "euclidean",
  scale_features = FALSE,
  handle_noise = "keep"
)

Arguments

prop_matrix

Matrix or data frame where rows are samples and columns are features, entries are missingness proportions in [0,1]. Can be created with create_missingness_prop_matrix().

n_clusters

Integer; number of clusters for KMeans. If NULL, uses HDBSCAN (default: NULL).

seed

Integer; random seed for KMeans reproducibility (default: NULL).

min_cluster_size

Integer; HDBSCAN minimum cluster size. If NULL, Python default is used (typically a function of the number of samples) (default: NULL).

cluster_selection_epsilon

Numeric; HDBSCAN cluster selection threshold (default: 0.25).

metric

Character; distance metric "euclidean" or "cosine" (default: "euclidean").

scale_features

Logical; whether to standardize feature columns before clustering samples (default: FALSE).

handle_noise

Character; how to handle HDBSCAN noise points (-1): "keep" (each noise sample gets its own new cluster ID), "separate" (all noise samples share one new ID), or "merge" (noise samples assigned to largest existing cluster) (default: "keep").

Value

A list with:

  • clusters: Integer vector of cluster assignments per sample (may include -1 for HDBSCAN noise).

  • clusters_positive: Integer vector with all labels non-negative after applying handle_noise.

  • silhouette_score: Numeric silhouette score, or NULL if not computable.

  • sample_names: Character vector of sample names corresponding to clusters.

  • n_samples: Integer; number of samples (rows).

  • n_clusters_found: Integer; number of clusters found (excluding noise).

  • n_clusters_final: Integer; final number of clusters after noise handling.

  • n_noise: Integer; number of samples assigned to noise (HDBSCAN only).

  • handle_noise: The noise handling mode used.

Examples

set.seed(123)
dat <- data.frame(
  sample_id = paste0("s", 1:12),
  # Two features measured at 3 timepoints each → proportions by feature per sample
  A_1 = c(NA, rnorm(11)), A_2 = c(NA, rnorm(11)), A_3 = rnorm(12),
  B_1 = rnorm(12),        B_2 = c(rnorm(10), NA, NA), B_3 = rnorm(12)
)
pm <- create_missingness_prop_matrix(dat, index_col = "sample_id",
                                     repeat_feature_names = c("A","B"))
res <- cluster_on_missing_prop(pm, n_clusters = 2, metric = "cosine", scale_features = TRUE)
#> Error in py_get_attr(x, name, FALSE): AttributeError: module 'ciss_vae.utils.run_cissvae' has no attribute 'cluster_on_missing_prop'
#> Run `reticulate::py_last_error()` for details.
table(res$clusters_positive)
#> Error: object 'res' not found
res$silhouette_score
#> Error: object 'res' not found