From compound-ml
Find natural groups in data and label them with plain-language descriptions. Use when the user wants to segment customers, discover topics, group documents, or says 'cluster this', 'find segments', 'group these', or 'what topics are in this data'.
How this skill is triggered — by the user, by Claude, or both
Slash command
/compound-ml:ml-clusterThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
Find natural groups in a dataset using unsupervised clustering, then label each group with a plain-language description. No ML expertise required — the algorithm selection, parameter tuning, and result interpretation are all handled automatically.
Find natural groups in a dataset using unsupervised clustering, then label each group with a plain-language description. No ML expertise required — the algorithm selection, parameter tuning, and result interpretation are all handled automatically.
If no objective is provided, discover the most natural groupings in the data.
Check required packages:
python3 -c "import pandas; import sklearn; print('Core packages available')"
If pandas or sklearn are missing, report install instructions and stop.
Check optional packages (non-blocking):
python3 -c "
try: import umap; print('umap: available')
except ImportError: print('umap: not available (will skip dimensionality reduction)')
try: import hdbscan; print('hdbscan: available')
except ImportError: print('hdbscan: not available (will use sklearn KMeans instead)')
"
Load and profile the data using the same approach as ml-explore Phase 2. Identify:
Write the loaded data profile to checkpoint: .ml-checkpoints/ml-cluster/<timestamp>/profile.json
Choose representation strategy based on data type:
For text data:
Detect embedding provider using the cascade defined in AGENTS.md:
all-MiniLM-L6-v2 model. Process locally. For datasets over 1000 rows, process in batches of 500 and report progress.sklearn.feature_extraction.text.TfidfVectorizer with max_features=5000. Report to the user: "Using basic text analysis (TF-IDF). For higher quality results, install sentence-transformers."For numeric data:
Use the numeric columns directly. Apply sklearn.preprocessing.StandardScaler to normalize features.
For mixed data:
Embed text columns and concatenate with scaled numeric features.
Write representations to checkpoint: .ml-checkpoints/ml-cluster/<timestamp>/representations.npy
If UMAP is available AND the representation has more than 50 dimensions:
import umap
reducer = umap.UMAP(n_components=min(50, n_features), random_state=42)
reduced = reducer.fit_transform(representations)
Run this with timeout: 600000 — UMAP can be slow on large datasets.
If UMAP is not available, skip this step. Clustering will work on the raw representations (sklearn handles high-dimensional data).
Write reduced representations to checkpoint: .ml-checkpoints/ml-cluster/<timestamp>/reduced.npy
Choose algorithm based on available packages:
If HDBSCAN is available (preferred):
import hdbscan
min_size = max(5, len(data) // 50) # heuristic: at least 2% of data per cluster
clusterer = hdbscan.HDBSCAN(min_cluster_size=min_size, min_samples=3)
labels = clusterer.fit_predict(reduced_or_raw)
HDBSCAN automatically determines the number of clusters and identifies noise points (label -1).
If HDBSCAN is not available, use KMeans:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Try k from 2 to min(10, sqrt(n_samples))
best_k, best_score = 2, -1
for k in range(2, min(11, int(len(data)**0.5) + 1)):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(reduced_or_raw)
score = silhouette_score(reduced_or_raw, labels, sample_size=min(5000, len(data)))
if score > best_score:
best_k, best_score = k, score
Evaluate clustering quality:
Compute silhouette score. If the score is below 0.1, report:
"The data doesn't show distinct groups. This could mean the items are very similar to each other, or that the natural groupings don't align with the features available. Consider exploring the data with
ml-explorefirst to understand its structure."
Write cluster labels to checkpoint: .ml-checkpoints/ml-cluster/<timestamp>/labels.json
For each cluster:
Then use the LLM to generate plain-language labels and descriptions. For each cluster, provide the sampled items and ask for:
If the user provided an objective, frame the labels in that context (e.g., "Budget-Conscious Shoppers" for customer segmentation, "Technical Support Issues" for ticket categorization).
Format the output as:
## Clustering Results: [filename]
**Groups found:** [N]
**Method:** [HDBSCAN/KMeans] [with UMAP reduction / on raw features]
**Representations:** [sentence-transformers / TF-IDF / numeric features]
**Quality:** [Good/Fair/Poor] (silhouette score: [X])
### Group 1: [Label] ([N] items, [X]% of data)
[One-sentence description]
**Typical examples:**
- [Example 1 summary]
- [Example 2 summary]
- [Example 3 summary]
**What makes this group distinctive:**
- [Key characteristic 1]
- [Key characteristic 2]
### Group 2: [Label] ([N] items, [X]% of data)
...
[If HDBSCAN found noise points:]
### Uncategorized ([N] items, [X]% of data)
These items didn't fit clearly into any group. They may be unique or transitional between groups.
## Suggested Next Steps
- [Contextual suggestions based on results]
All descriptions must use plain language. Avoid terms like "centroid", "silhouette score", "dimensionality reduction" in user-facing output. The technical details (method, quality score) are included for reproducibility but explained simply.
On start, check for recent checkpoints (less than 24 hours old) in .ml-checkpoints/ml-cluster/:
Each phase writes its output to the checkpoint directory before proceeding.
ml-explorereferences/clustering-guide.md — Plain-language explanation of how clustering worksnpx claudepluginhub milasaurus/compound-ml --plugin compound-mlGuides creation, editing, and verification of skills for AI coding agents using test-driven development with subagent scenarios. Use when authoring or debugging skills.