Skip to contents

Groups samples with similar patterns of missingness across features using either K-means clustering (when n_clusters is specified) or Leiden (when n_clusters is NULL). This is useful for detecting cohorts with shared missing-data behavior (e.g., site/batch effects).

Usage

cluster_on_missing_prop(
  prop_matrix,
  n_clusters = NULL,
  seed = NULL,
  k_neighbors = NULL,
  leiden_resolution = 0.25,
  use_snn = TRUE,
  leiden_objective = "CPM",
  metric = "euclidean",
  scale_features = FALSE
)

Arguments

prop_matrix

Matrix or data frame where rows are samples and columns are features, entries are missingness proportions in [0,1]. Can be created with create_missingness_prop_matrix().

n_clusters

Integer; number of clusters for KMeans. If NULL, uses Leiden (default: NULL).

seed

Integer; random seed for KMeans reproducibility (default: NULL).

k_neighbors

Integer; Leiden minimum cluster size. If NULL, Python default is used (default: NULL).

leiden_resolution

Numeric; Leiden cluster selection threshold (default: 0.25).

use_snn

Logical; whether to use shared nearest neighbors (optional).

leiden_objective

Character; Leiden optimization objective (optional).

metric

Character; distance metric. Options include: "euclidean", "cosine" (default: "euclidean").

scale_features

Logical; whether to standardize feature columns before clustering samples (default: FALSE).

Value

A list with:

  • clusters: Integer vector of cluster assignments per sample.

  • silhouette_score: Numeric silhouette score, or NULL if not computable.

Examples

set.seed(123)

dat <- data.frame(
  sample_id = paste0("s", 1:12),
  # Two features measured at 3 timepoints each -> proportions by feature
  A_1 = c(NA, rnorm(11)),
  A_2 = c(NA, rnorm(11)),
  A_3 = rnorm(12),
  B_1 = rnorm(12),
  B_2 = c(rnorm(10), NA, NA),
  B_3 = rnorm(12)
)

pm <- create_missingness_prop_matrix(
  dat,
  index_col = "sample_id",
  repeat_feature_names = c("A", "B")
)

## cluster_on_missing_prop requires a working Python environment via reticulate
## Examples are wrapped in try() to avoid failures on CRAN check systems
try({
res <- cluster_on_missing_prop(
  pm,
  n_clusters = 2,
  metric = "cosine",
  scale_features = TRUE
)

table(res$clusters)
res$silhouette_score
})
#> Error in py_module_import(module, convert = convert) : 
#>   ModuleNotFoundError: No module named 'ciss_vae'
#> Run `reticulate::py_last_error()` for details.