Groups samples with similar patterns of missingness across features using either
K-means clustering (when n_clusters
is specified) or HDBSCAN (when n_clusters
is NULL
).
This is useful for detecting cohorts with shared missing-data behavior (e.g., site/batch effects).
Usage
cluster_on_missing_prop(
prop_matrix,
n_clusters = NULL,
seed = NULL,
min_cluster_size = NULL,
cluster_selection_epsilon = 0.25,
metric = "euclidean",
scale_features = FALSE,
handle_noise = "keep"
)
Arguments
- prop_matrix
Matrix or data frame where rows are samples and columns are features, entries are missingness proportions in
[0,1]
. Can be created withcreate_missingness_prop_matrix()
.- n_clusters
Integer; number of clusters for KMeans. If
NULL
, uses HDBSCAN (default:NULL
).- seed
Integer; random seed for KMeans reproducibility (default:
NULL
).- min_cluster_size
Integer; HDBSCAN minimum cluster size. If
NULL
, Python default is used (typically a function of the number of samples) (default:NULL
).- cluster_selection_epsilon
Numeric; HDBSCAN cluster selection threshold (default:
0.25
).- metric
Character; distance metric
"euclidean"
or"cosine"
(default:"euclidean"
).- scale_features
Logical; whether to standardize feature columns before clustering samples (default:
FALSE
).- handle_noise
Character; how to handle HDBSCAN noise points (
-1
):"keep"
(each noise sample gets its own new cluster ID),"separate"
(all noise samples share one new ID), or"merge"
(noise samples assigned to largest existing cluster) (default:"keep"
).
Value
A list with:
clusters
: Integer vector of cluster assignments per sample (may include -1 for HDBSCAN noise).clusters_positive
: Integer vector with all labels non-negative after applyinghandle_noise
.silhouette_score
: Numeric silhouette score, orNULL
if not computable.sample_names
: Character vector of sample names corresponding toclusters
.n_samples
: Integer; number of samples (rows).n_clusters_found
: Integer; number of clusters found (excluding noise).n_clusters_final
: Integer; final number of clusters after noise handling.n_noise
: Integer; number of samples assigned to noise (HDBSCAN only).handle_noise
: The noise handling mode used.
Examples
set.seed(123)
dat <- data.frame(
sample_id = paste0("s", 1:12),
# Two features measured at 3 timepoints each → proportions by feature per sample
A_1 = c(NA, rnorm(11)), A_2 = c(NA, rnorm(11)), A_3 = rnorm(12),
B_1 = rnorm(12), B_2 = c(rnorm(10), NA, NA), B_3 = rnorm(12)
)
pm <- create_missingness_prop_matrix(dat, index_col = "sample_id",
repeat_feature_names = c("A","B"))
res <- cluster_on_missing_prop(pm, n_clusters = 2, metric = "cosine", scale_features = TRUE)
#> Error in py_get_attr(x, name, FALSE): AttributeError: module 'ciss_vae.utils.run_cissvae' has no attribute 'cluster_on_missing_prop'
#> Run `reticulate::py_last_error()` for details.
table(res$clusters_positive)
#> Error: object 'res' not found
res$silhouette_score
#> Error: object 'res' not found