Selected clustering algorithms
implementation details

This page includes a brief description and source codes of selected clustering algorithms mostly from [Fränti2006]. The algorithms have been selected so that they represent the best ones in the class of iterative and hierarchical algorithms, or they are popular due to their simplicity of other reasons. Among GMM models, the EM algorithm has also been included.

  Documentation Implementation Recommended parameters
Random   C* -c -n1
Repeated K-means [DH73] C* -r10
SOM [Nasra88] C* -b2 -i1000 -r10,1
PNN [Fränti2000b] C*   pseudocode  
SPLIT [Fränti1997a] C* -g0 -e6 -h2,0 -l
SM [Kauko1998] C* -e6 -h2,1 -l -d -g2 -i999 -a0,(size)**
RS [Fränti2000a] C*   pseudocode -t2500
GA [Fränti2000c] C*   pseudocode -z10 -n0 -g2
FCM [Dunn1974] C++ -g200 -f10
EM (GMM) [McLachlan2001] C++ -r 10,10,yes

* For compiling VQ methods you need modules package.
** Use the same value as the codebook size.
*** All methods require also the codebook size as parameter (usually -s)

Random: A trivial method for modeling the data is to construct the codebook from K randomly chosen data vectors. The random codebook will be used as the initial solution for the iterative algorithms described below, but also as a reference solution for measuring the quality the clustering algorithms. A good clustering algorithm should produce significantly better codebook than the random selection.

Repeated K-means: K-means [McQueen] starts from any initial solution, which is then iteratively improved by two optimization steps as long as improvement is achieved. The method is known as LBG (Linde-Buzo-Gray) or GLA (Generalized Lloyd Algorithm) in vector quantization literature [LBG80]. Since the quality of the K-means is sensitive for the initial solution, we apply the method repeatedly each time starting from a new random initial solution [DH73]. The codebook providing the smallest MSE is retained as the final solution. We use the fast reduced-search variant of the K-means [Kauko2000].

SOM: Self-organizing map [Nasra88] is a neural network approach to the clustering problem. The neurons in the network are connected with a 1-D or 2-D structure, and they correspond to the code vectors. Each feature vector is fed to the network by finding the nearest code vector. The best matched code vector and its neighboring vectors in the network are updated by moving them towards the input vector. After processing the training set by a predefined number of times, the neighborhood size is shrunk. The entire process is repeated until the neighborhood size shrinks to zero.

PNN: Pairwise nearest neighbor [Equitz89] [Fränti2000b] generates the codebook hierarchically. It starts by initializing each feature vector as a separate code vector. Two code vectors are merged at each step of the algorithm and the process is repeated until the desired size of the codebook is obtained. The code vectors to be merged are always the ones whose merge increase the distortion least. We recommend the fast exact PNN method introduced in [Fränti2000b]. Its main benefits are the simplicity of implementation and reasonably good quality of the clustering result.

SPLIT: An opposite, top-down approach starts with a single cluster including all the feature vectors. New clusters are then created one at a time by dividing the existing clusters. The splitting process is repeated until the desired number of clusters is reached. This approach usually requires much less computation than the PNN. We use the method in [Fränti1997a] that always selects the optimal hyperplane dividing the particular cluster along its principal axis, augmented with a local repartitioning phase at each division step. This SPLIT variant gives comparable results to that of the PNN but with a much faster algorithm at the cost of significantly more complicated implementation.

SM: Split-and-Merge [Kauko1998] is an iterative method that modifies the codebook by a series of split and merge operations. At every step, the code vectors to be split and merged are chosen as the ones that provides best improvement (split), or least increase (merge) in the distortion. The method provides high quality codebooks but with a significantly more complex implementation than the other methods.

RS: Random Swap [Fränti2000a] algorithm starts with a random codebook, which is then iteratively improved. At every step, a randomly selected code vector is tentatively re-allocated to a randomly chosen training vector. The new candidate codebook is fine-tuned by two iterations of K-means, the solution is then evaluated and accepted if it improves the previous solution. The algorithm is iterated for a fixed number of iterations. This trial-and-error approach is simple to implement and surprisingly effective. The efficiency of the method has been studied in [Fränti2008].

FCM: Fuzzy C-means [Dunn1974] generalizes K-means to fuzzy clustering, in which data vectors can belong to several partitions at the same time with a given weight. Traditional K-means is then applied in the final step in order to obtain the centroids (codebook) from the fuzzy partitions.

GA: Genetic algorithm [Fränti1997b] [Fränti2000c] generates a set of solutions (called population) and evolves it using the survival of the fittest principle. PNN is used as the crossover to generate new candidates, which are further fine-tuned by k-means. The algorithm has outperformed so far every competitive clustering algorithm for more than a decade already. Slightly better results have been reported only by other (more complicated) GA variants [Kivijarvi2003] [Fränti2006]. It therefore serves as a reference point as how good one can do with cluster model.

GMM: Expectation-Maximization (EM) algorithm [Bishop96] [McLachlan2001] is used for training Gaussian Mixture Model (GMM) as described in [Reynolds95]. Initial guess is made for the model parameters, and the solution is then improved by two optimization steps similarly as in K-means. Since the EM algorithm is also sensitive for the initialization, it is recommended to start it repeatedly from a new random solution, which is first fine-tuned by K-means. The result providing the highest likelihood is retained as the final model. In the implementation, diagonal covariances is used instead of the full covariance matrices. This is more common variant in speech and speaker recognition applications because much less training data is required and it is computationally much more efficient because full covariance matrix would require inversions.

References

[Bishop96] C.M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, New York, 1996.
[DH73] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, John Wiley and Sons, New York, 1973.
[Dunn1974] J.C. Dunn, "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters", Journal of Cybernetics, 3 (3), 32-57, 1974.
[Equitz89] W.H. Equitz, "A New Vector Quantization Clustering Algorithm", IEEE Trans. on Acoustics, Speech, and Signal Processing, 37 (10), 1568-1575, 1989.
[Fränti1997a] P. Fränti, T. Kaukoranta and O. Nevalainen, "On the Splitting Method for Vector Quantization Codebook Generation", Optical Engineering, 36(11), 3043-3051, 1997. (pdf)
[Fränti1997b] P. Fränti, J. Kivijärvi, T. Kaukoranta, O. Nevalainen, Genetic algorithms for large scale clustering problem, The Computer Journal, 40 (9), 547-554, 1997. (pdf)
[Fränti2000a] P. Fränti and J. Kivijärvi, "Randomized Local Search Algorithm for the Clustering Problem", Pattern Analysis and Applications, 3(4), pp. 358-369, 2000. (pdf)
[Fränti2000b] P. Fränti, T. Kaukoranta, D.-F. Shen and K.-S. Chang, "Fast and Memory Efficient Implementation of the Exact PNN", IEEE Trans. on Image Processing, 9(5), pp. 773-777, 2000. (pdf)
[Fränti2000c] P. Fränti, "Genetic algorithm with deterministic crossover for vector quantization", Pattern Recognition Letters, 21 (1), 61-68, 2000. (pdf)
[Fränti2006] P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006. (pdf)
[Fränti2008] P. Fränti, O. Virmajoki and V. Hautamäki, "Probabilistic clustering by random swap algorithm", IAPR Int. Conf. on Pattern Recognition (ICPR'08), Tampa, Florida, USA, December 2008.
[Kauko1998] T. Kaukoranta, P. Fränti and O. Nevalainen, "Iterative Split-and-Merge Algorithm for VQ Codebook Generation", Optical Engineering, 37(10), pp. 2726-2732, 1998. (pdf)
[Kauko2000] T. Kaukoranta, P. Fränti and O. Nevalainen, "A fast exact GLA based on code vector activity detection", IEEE Transactions on Image Processing, 9 (8), 1337-1342, August 2000. (pdf)
[Kivijärvi2003] J. Kivijärvi, P. Fränti and O. Nevalainen, "Self-adaptive genetic algorithm for clustering", Journal of Heuristics, 9 (2), 113-129, March 2003. (pdf)
[LBG80] Y. Linde, A. Buzo and R.M. Gray, "An Algorithm for Vector Quantizer Design". IEEE Trans. on Communications, 28(1), pp. 84-95, 1980.
[McLachlan2001] G. McLachlan and D. Peel, Finite Mixture Models, John Wiley & Sons, Brisbane, 2001.
[McQueen] J.B. McQueen, "Some methods for classification and analysis of multivariate observations". 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, Berkeley, 1967.
[Nasra88] N.M. Nasrabadi and Y. Feng, "Vector Quantization of Images Based upon the Kohonen Self-Organization Feature Maps", Neural Networks, 1, p. 518, 1988.
[Reynolds95] D.A. Reynolds and R.C. Rose, "Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models", IEEE Trans. Speech and Audio Processing, 3(1), pp. 72-83, 1995.



Machine learning page