CURE algorithm - Revision history

David Pal: Improve formatting

2025-03-29T18:03:11Z

Improve formatting

← Previous revision		Revision as of 18:03, 29 March 2025
Line 15:		Line 15:
	To avoid the problems with non-uniform sized or shaped clusters, CURE employs a [[hierarchical clustering]] algorithm that adopts a [[wikt:middle ground\|middle ground]] between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.		To avoid the problems with non-uniform sized or shaped clusters, CURE employs a [[hierarchical clustering]] algorithm that adopts a [[wikt:middle ground\|middle ground]] between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.

	Running time is O(''n~~''<sup>~~2~~</sup>~~ log ''n''), making it rather expensive, and [[Computational complexity theory\|space complexity]] is O(''n'').		Running time is <math>O(n^2 \log n)</math>, making it rather expensive, and [[Computational complexity theory\|space complexity]] is <math>O(n)</math>.

	The algorithm cannot be directly applied to large databases because of the high runtime complexity. Enhancements address this requirement.		The algorithm cannot be directly applied to large databases because of the high runtime complexity. Enhancements address this requirement.

	* Random sampling : [[Sampling (statistics)\|random sampling]] supports large data sets. Generally the [[random sample]] fits in [[Primary storage\|main memory]]. The random sampling involves a [[Trade-off\|trade off]] between accuracy and efficiency.		* Random sampling: [[Sampling (statistics)\|random sampling]] supports large data sets. Generally the [[random sample]] fits in [[Primary storage\|main memory]]. The random sampling involves a [[Trade-off\|trade off]] between accuracy and efficiency.
	* Partitioning : The basic idea is to partition the [[sample space]] into ''p'' partitions. Each partition contains ''n/p'' elements. The first pass partially clusters each partition until the final number of clusters reduces to ''n/pq'' for some constant q ≥ 1. A second clustering pass on ''n/q'' partially clusters partitions. For the second pass only the representative points are stored since the merge procedure only requires representative points of previous clusters before computing the representative points for the merged cluster. Partitioning the input reduces the execution times.		* Partitioning: The basic idea is to partition the [[sample space]] into ''p'' partitions. Each partition contains ''n/p'' elements. The first pass partially clusters each partition until the final number of clusters reduces to ''n/pq'' for some constant q ≥ 1. A second clustering pass on ''n/q'' partially clusters partitions. For the second pass only the representative points are stored since the merge procedure only requires representative points of previous clusters before computing the representative points for the merged cluster. Partitioning the input reduces the execution times.
	* Labeling data on disk : Given only representative points for ''k'' clusters, the remaining data points are also assigned to the clusters. For this a fraction of randomly selected representative points for each of the ''k'' clusters is chosen and data point is assigned to the cluster containing the representative point closest to it.		* Labeling data on disk: Given only representative points for ''k'' clusters, the remaining data points are also assigned to the clusters. For this a fraction of randomly selected representative points for each of the ''k'' clusters is chosen and data point is assigned to the cluster containing the representative point closest to it.

	== Pseudocode ==		== Pseudocode ==
Line 27:		Line 27:
	'''CURE (no. of points,''k'')'''		'''CURE (no. of points,''k'')'''

	Input : A set of points S		Input: A set of points S

	Output : ''k'' clusters		Output: ''k'' clusters

	* For every cluster u (each input point), in u.mean and u.rep store the mean of the points in the cluster and a set of ''c'' representative points of the cluster (initially ''c'' = 1 since each cluster has one data point). Also u.closest stores the cluster closest to u.		* For every cluster u (each input point), in u.mean and u.rep store the mean of the points in the cluster and a set of ''c'' representative points of the cluster (initially ''c'' = 1 since each cluster has one data point). Also u.closest stores the cluster closest to u.

VulcanSphere: Importing Wikidata short description: "Data clustering algorithm" (Shortdesc helper)

2022-04-29T22:09:28Z

Importing Wikidata short description: "Data clustering algorithm" (Shortdesc helper)

← Previous revision		Revision as of 22:09, 29 April 2022
Line 1:		Line 1:
			{{Short description\|Data clustering algorithm}}
	{{Machine learning\|Clustering}}		{{Machine learning\|Clustering}}
	'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.		'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.

Cnwilliams: Disambiguated: middle ground → wikt:middle ground

2021-10-19T09:02:06Z

Disambiguated: middle ground → wikt:middle ground

← Previous revision		Revision as of 09:02, 19 October 2021
Line 12:		Line 12:
	== CURE clustering algorithm ==		== CURE clustering algorithm ==

	To avoid the problems with non-uniform sized or shaped clusters, CURE employs a [[hierarchical clustering]] algorithm that adopts a [[middle ground~~]]{{Disambiguation needed~~\|~~date=October~~ ~~2021}}~~ between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.		To avoid the problems with non-uniform sized or shaped clusters, CURE employs a [[hierarchical clustering]] algorithm that adopts a [[wikt:middle ground\|middle ground]] between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.

	Running time is O(''n''<sup>2</sup> log ''n''), making it rather expensive, and [[Computational complexity theory\|space complexity]] is O(''n'').		Running time is O(''n''<sup>2</sup> log ''n''), making it rather expensive, and [[Computational complexity theory\|space complexity]] is O(''n'').

GoingBatty: v2.04 - Disambiguation help needed for one link - Middle ground, 1 to be fixed - Middle ground

2021-10-02T05:07:39Z

v2.04 - Disambiguation help needed for one link - Middle ground, 1 to be fixed - Middle ground

← Previous revision		Revision as of 05:07, 2 October 2021
Line 12:		Line 12:
	== CURE clustering algorithm ==		== CURE clustering algorithm ==

	To avoid the problems with non-uniform sized or shaped clusters, CURE employs a [[hierarchical clustering]] algorithm that adopts a [[middle ground]] between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.		To avoid the problems with non-uniform sized or shaped clusters, CURE employs a [[hierarchical clustering]] algorithm that adopts a [[middle ground]]{{Disambiguation needed\|date=October 2021}} between the centroid based and all point extremes. In CURE, a constant number c of well scattered points of a cluster are chosen and they are shrunk towards the centroid of the cluster by a fraction α. The scattered points after shrinking are used as representatives of the cluster. The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's hierarchical clustering algorithm. This enables CURE to correctly identify the clusters and makes it less sensitive to outliers.

	Running time is O(''n''<sup>2</sup> log ''n''), making it rather expensive, and [[Computational complexity theory\|space complexity]] is O(''n'').		Running time is O(''n''<sup>2</sup> log ''n''), making it rather expensive, and [[Computational complexity theory\|space complexity]] is O(''n'').

Hooman Mallahzadeh: Collapsing sidebar.

2021-07-12T04:55:12Z

Collapsing sidebar.

← Previous revision		Revision as of 04:55, 12 July 2021
Line 1:		Line 1:
	{{~~machine~~ learning ~~bar~~}}		{{Machine learning\|Clustering}}
	'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.		'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.

Ira Leviton: Fixed typos found with Wikipedia:Typo_Team/moss.

2020-01-26T23:07:20Z

Fixed typos found with Wikipedia:Typo_Team/moss.

← Previous revision		Revision as of 23:07, 26 January 2020
Line 24:		Line 24:
	== Pseudocode ==		== Pseudocode ==

	'''CURE(no. of points,''k'')'''		'''CURE (no. of points,''k'')'''

	Input : A set of points S		Input : A set of points S
Line 33:		Line 33:
	* All the input points are inserted into a [[Kd-tree\|k-d tree]] T		* All the input points are inserted into a [[Kd-tree\|k-d tree]] T
	* Treat each input point as separate cluster, compute u.closest for each u and then insert each cluster into the heap Q. (clusters are arranged in increasing order of distances between u and u.closest).		* Treat each input point as separate cluster, compute u.closest for each u and then insert each cluster into the heap Q. (clusters are arranged in increasing order of distances between u and u.closest).
	* While size(Q) > ''k''		* While size (Q) > ''k''
	* Remove the top element of Q(say u) and merge it with its closest cluster u.closest(say v) and compute the new representative points for the merged cluster w.		* Remove the top element of Q (say u) and merge it with its closest cluster u.closest (say v) and compute the new representative points for the merged cluster w.
	* Remove u and v from T and Q.		* Remove u and v from T and Q.
	* For all the clusters x in Q, update x.closest and relocate x		* For all the clusters x in Q, update x.closest and relocate x

Rotondus: + template

2018-07-07T13:33:37Z

+ template

← Previous revision		Revision as of 13:33, 7 July 2018
Line 1:		Line 1:
			{{machine learning bar}}
	'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.		'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.

AnomieBOT: Dating maintenance tags: {{Citation needed}}

2018-05-18T17:03:19Z

Dating maintenance tags: {{Citation needed}}

← Previous revision		Revision as of 17:03, 18 May 2018
Line 1:		Line 1:
	'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.		'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.

	== Drawbacks of traditional algorithms ==		== Drawbacks of traditional algorithms ==

Nbro at 16:31, 18 May 2018

2018-05-18T16:31:15Z

← Previous revision		Revision as of 16:31, 18 May 2018
Line 41:		Line 41:
	== Availability ==		== Availability ==
	* [https://github.com/annoviko/pyclustering pyclustering] open source library includes a Python and C++ implementation of CURE algorithm.		* [https://github.com/annoviko/pyclustering pyclustering] open source library includes a Python and C++ implementation of CURE algorithm.

			==See also==
			* [[k-means clustering]]
			* [[BFR algorithm]]

	== References ==		== References ==

Nbro at 16:28, 18 May 2018

2018-05-18T16:28:58Z

← Previous revision		Revision as of 16:28, 18 May 2018
Line 1:		Line 1:
	'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.		'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.

	== Drawbacks of traditional algorithms ==		== Drawbacks of traditional algorithms ==

← Previous revision		Revision as of 22:09, 29 April 2022
Line 1:		Line 1:
			{{Short description\|Data clustering algorithm}}
	{{Machine learning\|Clustering}}		{{Machine learning\|Clustering}}
	'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.		'''CURE''' (Clustering Using REpresentatives) is an efficient [[data clustering]] algorithm for large [[database]]s{{citation needed\|date=May 2018}}. Compared with [[K-means clustering]] it is more [[robust statistics\|robust]] to [[outlier]]s and able to identify clusters having non-spherical shapes and size variances.