Skip to contents

Given an R data.frame or matrix with missing values, clusters on the pattern of missingness and returns cluster labels plus silhouette score.

Usage

cluster_on_missing(
  data,
  cols_ignore = NULL,
  n_clusters = NULL,
  seed = NULL,
  min_cluster_size = NULL,
  cluster_selection_epsilon = 0.25
)

Arguments

data

A data.frame or matrix (samples × features), may contain NA.

cols_ignore

Character vector of column names to ignore when clustering.

n_clusters

Integer; if provided, will run KMeans with this many clusters. If NULL, will use HDBSCAN.

seed

Integer; random seed for KMeans (or reproducibility in HDBSCAN).

min_cluster_size

Integer; minimum cluster size for HDBSCAN. If NULL, defaults to nrow(data) %/% 25.

cluster_selection_epsilon

Numeric; epsilon parameter for HDBSCAN.

Value

A list with components:

  • clusters — integer vector of cluster labels

  • silhouette — numeric silhouette score, or NA if not computable