Clustering algorithms using stochastic analysis and ensemble techniques.
split_by_quantiles(vec, quantiles=0.95, index = None)
Like split_by_vals
, but cuts the vector at specific quantiles
rather than rigid values. Assumes right-continuity of the
cumulative distribution function.
For a vector \(v_i\) let
\[F(x) = \frac{\left|\{i: v_i \leq x\}\right|}{\mathrm{len}( \mathbf{v})}\]be the ratio of components that are less than or equal to \(x\). (This is effectively the cumulative distribution function of \(\mathbf{v}\)’s values.) Given value \(p\in [0,1]\), define the quantile function as
\[Q(p) = \min\{x:p\leq F(x)\}\]The method split_by_quantiles
receives a monotonically increasing
list of quantiles, \((p_1,\dots,p_k)\), each between 0 and 1,
and returns
If any of these clusters are empty, they will be discarded and the clusters relabeled to begin at \(0\) and end at \(N-1\) where \(N\) is the final number of clusters.
Note the asymmetry in that clusters will always contain their upper bound
but not their lower bound (this is in line with the standard
right-continuity of the cumulative distribution function).
To change the orientation of this asymmetry, the user
can pass -vec
as the argument instead.
Arguments | Type | Description | |
---|---|---|---|
vec |
np.ndarray |
A one-dimensional array of values. | |
quantiles |
Keyword | float , list or np.ndarray |
Quantiles which will be used to divide the vector components. |
index |
Keyword | Index |
The Index which labels the indices of vec , and which will be the item set of the returned Aggregation . |
First we provide a direct example which shows the behavior as well as subtleties of the quantile function. We pass the vector \(\mathbf{v} = (0,0,1,1,2,2,3,3,4,4,5,5)\) and the quantiles \(\{0,0.5,0.9\}\):
import stoclust.clustering as clust
import numpy as np
vec = np.array([0,0,1,1,2,2,3,3,4,4])
clust.split_by_quantiles(vec,quantiles=[0.0,0.5,0.9])
Aggregation(
1: Index([2, 3, 4, 5])
0: Index([0, 1])
2: Index([6, 7, 8, 9])
)
Note first that the \(0.5\) quantile divides the vector into components with values \(\{0,1,2\}\) and \(\{3,4\}\). This is because
\[Q(0.5) = \min\{2,3,4\} = 2\]and because our clustering scheme will divide the vector into components less than or equal to \(Q(p)\) and greater than \(Q(p)\).
Next, the \(0\) quantile separates the lowest value \(0\) from the rest; this is because
\[Q(0) = \min\{0,1,2,3,4\} = 0\]and so a cluster is created for elements less than or equal to zero.
Finally, the \(0.9\) quantile has no effect—we have
\[Q(0.9) = \min\{4\} = 4\]and the cluster with elements greater than \(4\) is empty.
The second example is visual. We randomly generate a vector with 50 components, its values drawn from a uniform distribution and then squared. After quantile clustering and plotting the sorted values, we see that the quantiles \(\{0.2,0.7,0.8\}\) result in clusters with 10, 25, 5 and 10 elements, respectively, corresponding proportionally to the gaps in the quantiles. Though the upper quintile only contains \(20\%\) of the components, it covers over \(50\%\) of the range of values.
import stoclust.visualization as viz
vec = np.random.rand(50)**2
agg = clust.split_by_quantiles(vec,quantiles=[0.2,0.7,0.8])
rank = np.empty_like(vec)
rank[np.argsort(vec)] = np.arange(len(vec))
fig = viz.scatter2D(
rank,vec,agg=agg,mode='markers',
layout=dict(
title = 'Sorted vector components (quantile clustering)',
xaxis = dict(
title='Sorted rank'
),
yaxis = dict(
title='Value'
),
legend = dict(
title='Cluster'
)
)
)
fig.show()