A single-pass, statistical clustering engine for Graph Wiring to build spectral Graph Laplacians.
kalman_clustering replaces traditional multi-trial K-Means (Lloyd's algorithm) with a Kalman Filter-based adaptive streaming algorithm. Instead of treating centroids as static points in geometric space, this library models them as Gaussian distributions (Mean
This approach solves two critical bottlenecks in the manifold pipeline:
- Speed: Eliminates the expensive "guess and check" for an optimal-k heuristic.
- Spectral-awareness: Constructs the Graph Laplacian using Statistical Overlap (Bhattacharyya Distance) rather than simple Euclidean proximity, preserving the true manifold structure.
use kalman_clustering::KalmanClusterer;
fn main() {
let rows: Vec<Vec<f64>> = load_embeddings();
let max_k = 256; // Upper bound on number of clusters
let n_items = rows.len();
// Initialize adaptive clusterer
let mut clusterer = KalmanClusterer::new(max_k, n_items);
// Single-pass clustering with automatic K selection
clusterer.fit(&rows);
println!("Auto-selected {} centroids", clusterer.centroids.len());
// Access results
for (i, centroid) in clusterer.centroids.iter().enumerate() {
println!("Centroid {}: mean={:?}, var={:?}, count={}",
i, centroid.mean, centroid.variance, centroid.count);
}
}- 🚀 Single-Pass Learning: Adapts centroids dynamically as data streams in. No need to load the entire dataset into memory at once.
- 🧠 Automatic K-Selection: Dynamically spawns new centroids when data points fall outside the statistical confidence interval (Mahalanobis distance) of existing clusters.
- 🔗 Statistical Wiring: Connects the Graph Laplacian based on probability distribution overlap, ensuring spectral diffusion flows through data densities naturally.
- 📉 Variance Tracking: Each centroid tracks its own uncertainty. High-variance clusters naturally bond with neighbors, while low-variance (sharp) clusters remain distinct.
Every centroid is a Kalman Filter. When a data point is assigned to a centroid, we perform a Measurement Update:
-
Predict: Increase variance slightly (Process Noise
$Q$ ) to allow for drift. - Gain: Compute Kalman Gain based on current uncertainty.
-
Correct: Move mean
$\mu$ towards data; shrink variance$\Sigma$ .
For each incoming point
- Compute Mahalanobis Distance to all centroids: $ D_M(x) = \sqrt{ \sum \frac{(x_i - \mu_i)^2}{\sigma_i^2} } $
-
Split: If
$D_M > \text{Threshold}$ (typically$3\sigma$ ), the point is an outlier. Spawn a new centroid. -
Merge: If
$D_M \le \text{Threshold}$ , update the nearest centroid's state.
Instead of Euclidean distance, we build the Graph Laplacian using Bhattacharyya Distance. This connects centroids that statistically overlap, not just those that are close.
Result: The Laplacian
Manifold = Laplacian(Transpose(Centroids))
| Feature | Standard K-Means | Kalman Centroids |
|---|---|---|
| Complexity | ||
| K-Selection | Slow heuristic (Calinski-Harabasz) | Automatic & Dynamic |
| Metric | Euclidean Only | Mahalanobis (Adaptive) |
| Laplacian | Geometric Proximity | Statistical Overlap |
| Memory | Batch (All Data) | Streaming (Row by Row) |