Clustering algorithms using stochastic analysis and ensemble techniques.
fushing_mcassey(st_mat, max_visits=5, time_quantile_cutoff=0.95, index=None)
Given a square column-stochastic matrix
(that is, one with np.sum(st_mat,axis=1)=[1 ... 1]
) describing the strength
of the relationship between pairs of items,
determines an Aggregation
of the items using
the regulated random walk approach of Fushing and McAssey.
The algorithm is inherently random
and highly unstable as a single-shot approach,
but may be used in an ensemble to determine a
useful similarity matrix.
Suppose st_mat
is given by the Markov matrix \(\mathbf{T}\).
A regulated random walk is taken using \(\mathbf{T}\) as the initial
transition probabilities, and modifying these probabilities
to remove from circulation any node which has been visited
at least max_visits
times (this prevents the walk from
being stuck in a cluster for too long). The time between removals
is recorded; the highest values (determined by time_quantile_cutoff
)
determine the number of clusters (it is interpreted that a sudden, long
removal time after many short removal times indicates
one has left a highly-explored cluster and entered an unexplored one).
A node which was removed and for which \(>50\%\) of its visits prior to removal were in a particular time-interval is placed in the cluster associated with that time interval; all other nodes remain unclustered.
This algorithm will not return useful results after a single run, but if an ensemble of runs is collected it may be used to derive a similarity matrix, based on how often two nodes are in a cluster together over the many runs.
Arguments | Type | Description | |
---|---|---|---|
st_mat |
np.ndarray |
A square stochastic matrix describing a Markov dynamic. | |
max_visits |
Keyword | int |
The maximum number of visits to a node before it is removed in the regulated random walk. |
time_quantile_cutoff |
Keyword | float |
The quantile of the length of time between node removals, which is used to determine the number of clusters. |
index |
Keyword | Index |
The Index which labels the indices of st_mat , and which will be the item set of the returned Aggregation . |
First we will generate a sample dataset with two concentric rings:
import stoclust.visualization as viz
import stoclust as sc
import numpy as np
import pandas as pd
n1 = 200
n2 = 50
samples = np.concatenate([
sc.examples.gen_disk(
num_samples=n1,
rad1=1.3,rad2=1.6
),
sc.examples.gen_disk.gen_disk(
num_samples=n2,
rad1=0,rad2=0.2
)
])
agg = sc.Aggregation(
pd.Index(np.arange(n1+n2)),
pd.Index(['Outer','Inner']),
{
0:np.arange(0,n1),
1:np.arange(n1,n1+n2)
}
)
viz.scatter2D(
samples[:,0],samples[:,1],
agg=agg, mode='markers',
text=agg.items.elements.astype(str),
hoverinfo='text',
layout = {
'margin':dict(l=10, r=10, t=10, b=10)
}
)
The Fushing-McAssey algorithm is not designed to be only used once, like other clustering algorithms in this package; rather, it is intended to be used with the ensemble feature.
The following is also an example of applying
the stoclust.ensemble.random_clustering
method,
which applies the given clustering method a given number
of times in order to generate an ensemble of
block matrices, which can then be averaged to give
an overall similarity matrix, which we visualize.
dist = sc.distance.euclid(samples)
T = sc.utils.stoch(
np.exp(-dist/0.05) - np.eye(dist.shape[0])
)
ens = sc.ensemble.random_clustering(
T,
clustering_method = sc.clustering.fushing_mcassey,
ensemble_size = 500
)
fig = viz.heatmap(
np.average(ens,axis=0)-np.eye(ens.shape[1])
)
fig.show()
This similarity matrix shows a clear separation between the first 200 indices (the outer ring) and the final 50 indices (the inner ring). Normalizing this similarity matrix and using the subleading eigenvalue for spectral clustering gives the appropriate division:
import scipy.linalg as la
TT = sc.utils.stoch(
np.average(ens,axis=0)-np.eye(ens.shape[1])
)
new_agg = sc.clustering.shi_malik(
TT,
eig_thresh=np.sort(la.eig(TT)[0])[-2]
)
fig = viz.scatter2D(
samples[:,0],samples[:,1],
agg=new_agg, mode='markers',
text=new_agg.items.elements.astype(str),
hoverinfo='text',
layout = {
'margin':dict(l=10, r=10, t=10, b=10)
}
)
fig.show()
H. Fushing and M. P. McAssey, “Time, temperature, and data cloud geometry,” in Phys. Rev. E, vol. 82, 061110, Dec. 2010, doi: 10.1103/PhysRevE.82.061110.