shi_malik

↳ clustering

shi_malik(st_mat, eig_thresh=0.95, tol=0, index=None)

Given a square column-stochastic matrix (that is, one with np.sum(st_mat,axis=1)=[1 ... 1]) describing the strength of the relationship between pairs of items, determines an Aggregation of the items using the spectral approach of Shi and Malik.

A column-stochastic matrix \(\mathbf{T}\) will always have a leading eigenvalue of 1 and a leading uniform right-eigenvector, \(\mathbf{u}=(1,...,1)\), which is a fixed point of the map:

\[\sum_{j}T_{ij} u_j = u_i\]

If \(\mathbf{T}\) has no disconnected components then u is the unique fixed point (up to a constant scaling) and the sub-leading eigenvalue is strictly less than one; otherwise, the eigenvalue 1 is degenerate. In the first case, if the sub-leading eigenvalue is close to 1, then the sub-leading right-eigenvector \(\mathbf{y}\) may be used to partition the indices into two slowly-decaying communities.

The Shi-Malik algorithm is recursive, taking the sub-leading eigenvector of \(\mathbf{T}\) (as long as the corresponding eigenvalue is above a threshold), using it to bipartition the indices, and then repeating these steps on the partitions with a reweighted matrix. This implementation cuts the vector \(\mathbf{T}\) by value, by default into components \(y_i>0\) and \(y_i<=0\), because of the orthogonality relationship

\[\left<\mathbf{y}\right>_{\pi} = \sum_i y_i \pi_i = 0\]

which indicates that the mean value of \(\mathbf{y}\) under the stationary distribution \(\pi\)(left-eigenvector of \(\mathbf{T}\)) must always be zero, making this a value of significance.

The algorithm halts when no community has a sub-leading eigenvector above the threshold, and the final partitioning is returned as an Aggregation.

Arguments

Arguments		Type	Description
`st_mat`		`np.ndarray`	A square column-stochastic matrix describing a Markov dynamic.
`eig_thresh`	Keyword	`float`	The smallest value the subleading eigenvalue may have to continue the recursion.
`cut`	Keyword	`float`	The value used to “cut” the subleading eigenvector into two clusters.
`index`	Keyword	`Index`	The `Index` which labels the indices of `st_mat`, and which will be the item set of the returned `Aggregation`.

Example

First we will generate a sample dataset with two touching circles:

import stoclust.visualization as viz
import stoclust as sc
import numpy as np
import pandas as pd

n = 500

samples = np.concatenate([
    sc.examples.gen_disk(num_samples=n,rad1=0,rad2=1),
    sc.examples.gen_disk(num_samples=n,rad1=0,rad2=1)+
    np.array([2,0])
])

agg = sc.Aggregation(
    pd.Index(np.arange(samples.shape[0])),
    pd.Index(['Left','Right']),
    {
        0:np.arange(0,n),
        1:np.arange(n,2*n)
    }
)

Next, we apply the Shi-Malik algorithm with eig_thresh = 0.98. This results in a crisp division into two circles.

dist = sc.distance.euclid(samples)

T = sc.utils.stoch(
    np.exp(-dist/0.05) - np.eye(dist.shape[0])
)

new_agg = clust.shi_malik(T,eig_thresh=0.98)

fig = viz.scatter2D(
    samples[:,0],samples[:,1],
    agg=new_agg, mode='markers',
    text=new_agg.items.elements.astype(str),
    hoverinfo='text,name',
    layout = {
        'margin':dict(l=10, r=10, t=10, b=10)
    }
)

fig.show()

However, note that lowering the threshold even a little to 0.95 results in the appearance of spurious sub-clusters:

dist = sc.distance.euclid(samples)

T = sc.utils.stoch(
    np.exp(-dist/0.05) - np.eye(dist.shape[0])
)

# Note the new value of eig_thresh!
new_agg = clust.shi_malik(T,eig_thresh=0.95)

fig = viz.scatter2D(
    samples[:,0],samples[:,1],
    agg=new_agg, mode='markers',
    text=new_agg.items.elements.astype(str),
    hoverinfo='text,name',
    layout = {
        'margin':dict(l=10, r=10, t=10, b=10)
    }
)

fig.show()

Reference

Jianbo Shi and J. Malik, “Normalized cuts and image segmentation,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000, doi: 10.1109/34.868688.

stoclust

Home

Documentation

shi_malik

Arguments

Example

Reference