Kalman Clustering: Adaptive Clustering for Manifold creation

A single-pass, statistical clustering engine for Graph Wiring to build spectral Graph Laplacians.

Overview

kalman_clustering replaces traditional multi-trial K-Means (Lloyd's algorithm) with a Kalman Filter-based adaptive streaming algorithm. Instead of treating centroids as static points in geometric space, this library models them as Gaussian distributions (Mean $\mu$ + Variance $\Sigma$) that evolve over time.

This approach solves two critical bottlenecks in the manifold pipeline:

Speed: Eliminates the expensive "guess and check" for an optimal-k heuristic.
Spectral-awareness: Constructs the Graph Laplacian using Statistical Overlap (Bhattacharyya Distance) rather than simple Euclidean proximity, preserving the true manifold structure.

Quick Example

use kalman_clustering::KalmanClusterer;

fn main() {
    let rows: Vec<Vec<f64>> = load_embeddings(); 
    let max_k = 256; // Upper bound on number of clusters
    let n_items = rows.len();

    // Initialize adaptive clusterer
    let mut clusterer = KalmanClusterer::new(max_k, n_items);

    // Single-pass clustering with automatic K selection
    clusterer.fit(&rows); 
    
    println!("Auto-selected {} centroids", clusterer.centroids.len());
    
    // Access results
    for (i, centroid) in clusterer.centroids.iter().enumerate() {
        println!("Centroid {}: mean={:?}, var={:?}, count={}", 
                 i, centroid.mean, centroid.variance, centroid.count);
    }
}

Key Features

🚀 Single-Pass Learning: Adapts centroids dynamically as data streams in. No need to load the entire dataset into memory at once.
🧠 Automatic K-Selection: Dynamically spawns new centroids when data points fall outside the statistical confidence interval (Mahalanobis distance) of existing clusters.
🔗 Statistical Wiring: Connects the Graph Laplacian based on probability distribution overlap, ensuring spectral diffusion flows through data densities naturally.
📉 Variance Tracking: Each centroid tracks its own uncertainty. High-variance clusters naturally bond with neighbors, while low-variance (sharp) clusters remain distinct.

The Algorithm

1. Adaptive Updates (The "Brain")

Every centroid is a Kalman Filter. When a data point is assigned to a centroid, we perform a Measurement Update:

Predict: Increase variance slightly (Process Noise $Q$) to allow for drift.
Gain: Compute Kalman Gain based on current uncertainty.
Correct: Move mean $\mu$ towards data; shrink variance $\Sigma$.

$\mu_{new} = \mu_{old} + K(x - \mu_{old})$ $\Sigma_{new} = (1 - K)\Sigma_{old}$

2. Decision Logic (The "Manager")

For each incoming point $x$:

Compute Mahalanobis Distance to all centroids: $ D_M(x) = \sqrt{ \sum \frac{(x_i - \mu_i)^2}{\sigma_i^2} } $
Split: If $D_M > \text{Threshold}$ (typically $3\sigma$), the point is an outlier. Spawn a new centroid.
Merge: If $D_M \le \text{Threshold}$, update the nearest centroid's state.

3. Graph Wiring (The "Topology")

Instead of Euclidean distance, we build the Graph Laplacian using Bhattacharyya Distance. This connects centroids that statistically overlap, not just those that are close.

$D_B = \frac{1}{4} \frac{(\mu_1 - \mu_2)^2}{\sigma_1^2 + \sigma_2^2} + \frac{1}{2} \ln \left( \frac{\sigma_1^2 + \sigma_2^2}{2 \sqrt{\sigma_1^2 \sigma_2^2}} \right)$

Result: The Laplacian $L$ computed on these centroids encodes the manifold invariant:

Manifold = Laplacian(Transpose(Centroids))

Performance vs. K-Means

Feature	Standard K-Means	Kalman Centroids
Complexity	$O(N \times K \times \text{Iterations})$	$O(N \times K)$
K-Selection	Slow heuristic (Calinski-Harabasz)	Automatic & Dynamic
Metric	Euclidean Only	Mahalanobis (Adaptive)
Laplacian	Geometric Proximity	Statistical Overlap
Memory	Batch (All Data)	Streaming (Row by Row)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kalman Clustering: Adaptive Clustering for Manifold creation

Overview

Quick Example

Key Features

The Algorithm

1. Adaptive Updates (The "Brain")

2. Decision Logic (The "Manager")

3. Graph Wiring (The "Topology")

Performance vs. K-Means

About

Uh oh!

Releases 1

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Kalman Clustering: Adaptive Clustering for Manifold creation

Overview

Quick Example

Key Features

The Algorithm

1. Adaptive Updates (The "Brain")

2. Decision Logic (The "Manager")

3. Graph Wiring (The "Topology")

Performance vs. K-Means

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages