Jump to content

Canopy clustering algorithm

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 89.150.149.220 (talk) at 18:36, 12 October 2012 (the link was wrong). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

The canopy clustering algorithm is an unsupervised pre-clustering algorithm, often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm.

It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set.

The algorithm proceeds as follows:

  • Cheaply partitioning the data into overlapping subsets (called "canopies")
  • Perform more expensive clustering, but only within these canopies

Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the curse of dimensionality. Only when a cheap and approximative - low dimensional - distance function is available, the produced canopies will preserve the clusters produced by K-means.

The method first appeared in a paper by Andrew McCallum, Kamal Nigam and Lyle Ungar.

Benefits

  • The number of instances of training data that must be compared at each step is reduced
  • There is some evidence that the resulting clusters are improved[1]

References

References

  • McCallum, A.; Nigam, K.; and Ungar L.H. (2000) "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 doi:10.1145/347090.347123