Automatic clustering algorithms - Revision history

140.105.167.53 at 12:02, 20 May 2025

2025-05-20T12:02:08Z

← Previous revision		Revision as of 12:02, 20 May 2025
Line 27:		Line 27:

	In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite book \|date=June 2012 \|pages=1–7 \|language=en-US\|doi=10.1109/CEC.2012.6252874\|citeseerx=10.1.1.308.9977 \|chapter=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms \|title=2012 IEEE Congress on Evolutionary Computation \|last1=Meiguins \|first1=Aruanda S. G. \|last2=Limao \|first2=Roberto C. \|last3=Meiguins \|first3=Bianchi S. \|last4=Junior \|first4=Samuel F. S. \|last5=Freitas \|first5=Alex A. \|isbn=978-1-4673-1509-8 }}</ref>		In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite book \|date=June 2012 \|pages=1–7 \|language=en-US\|doi=10.1109/CEC.2012.6252874\|citeseerx=10.1.1.308.9977 \|chapter=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms \|title=2012 IEEE Congress on Evolutionary Computation \|last1=Meiguins \|first1=Aruanda S. G. \|last2=Limao \|first2=Roberto C. \|last3=Meiguins \|first3=Bianchi S. \|last4=Junior \|first4=Samuel F. S. \|last5=Freitas \|first5=Alex A. \|isbn=978-1-4673-1509-8 }}</ref>

			== AutoML for Clustering ==

			Recent advancements in automated machine learning (AutoML) have extended to the domain of clustering, where systems are designed to automatically select preprocessing techniques, feature transformations, clustering algorithms, and validation strategies without human intervention. Unlike traditional clustering methods that rely on fixed pipelines and manual tuning, AutoML-based clustering frameworks dynamically search for the best-performing configurations based on internal clustering validation indices (CVIs) or other unsupervised metrics.

			An implementation in this area is TPOT-Clustering<ref>https://github.com/Mcamilo/tpot-clustering/tree/main</ref>, an extension of the Tree-based Pipeline Optimization Tool (TPOT), which automates the process of building clustering pipelines using genetic programming. TPOT-Clustering explores combinations of data transformations, dimensionality reduction methods, clustering algorithms (e.g., K-means, DBSCAN, Agglomerative Clustering), and scoring functions to optimize clustering performance. It leverages an evolutionary algorithm to search the space of possible pipelines, using internal scores such as silhouette or Davies–Bouldin index to guide the selection process.

			AutoML for clustering is particularly useful in domains where the structure of the data is unknown and manual tuning is infeasible due to the high dimensionality or complexity of the feature space. These approaches are gaining popularity in areas such as image segmentation, customer segmentation, and bioinformatics, where unsupervised insights are critical.

	== References ==		== References ==

MrOllie: Reverted 4 edits by Aasimayaz (talk): Rv lengthy unsourced addition

2025-05-14T11:41:07Z

Reverted 4 edits by Aasimayaz (talk): Rv lengthy unsourced addition

← Previous revision		Revision as of 11:41, 14 May 2025
Line 1:		Line 1:
	{{short description\|Data processing algorithm}}		{{short description\|Data processing algorithm}}

			'''Automatic clustering algorithms''' are algorithms that can perform clustering without prior knowledge of data sets. In contrast with other [[cluster analysis]] techniques, automatic clustering algorithms can determine the optimal number of clusters even in the presence of noise and outlier points.<ref>[[Outlier]]</ref>{{context needed\|date=September 2021}}
	== Background ==
	Clustering, a core technique in data mining and machine learning, is an unsupervised learning method that groups similar data points into clusters based on defined similarity measures, such as Euclidean distance or cosine similarity. Unlike supervised learning, which relies on labeled data to train models, clustering operates without predefined class labels, seeking to uncover natural patterns or structures within the data. A key challenge in clustering is determining the optimal number of clusters (often denoted as ''k''), as this value significantly influences the quality and interpretability of the results. Traditional algorithms, such as k-means, require users to specify ''k'' in advance, which can be problematic in real-world applications like market segmentation, image analysis, or bioinformatics, where the true number of clusters is unknown or data complexity obscures clear groupings.

	Automatic clustering algorithms address this challenge by autonomously estimating the number of clusters during the clustering process, eliminating the need for manual specification of ''k''. These algorithms employ techniques such as statistical criteria (e.g., Bayesian Information Criterion or Akaike Information Criterion), density-based approaches, or hierarchical splitting/merging to identify an optimal number of clusters, even in datasets with noise or outliers. By adapting to the data’s inherent structure, automatic clustering algorithms enhance the robustness and flexibility of unsupervised learning. Their ability to function without prior knowledge makes them invaluable for exploratory data analysis, large-scale data processing, and applications where human intervention is impractical.

	The evolution of automatic clustering algorithms marks a significant milestone in unsupervised learning, enabling more efficient and scalable analysis of complex datasets. These methods empower data-driven discovery by automating a critical aspect of the clustering process, making them essential tools in fields ranging from scientific research to commercial data analytics.

	== Types of Automatic Clustering Algorithms ==
	Automatic clustering algorithms are grouped by their approach to determining cluster numbers and grouping data. Density-based methods identify clusters as high-density regions separated by low-density areas, estimating cluster counts without predefined input. For example, DBSCAN groups points within a radius (ε) with sufficient neighbors (MinPts), marking outliers as noise, and excels with arbitrary shapes but falters with varying densities. OPTICS extends DBSCAN with hierarchical density analysis for better handling of density variations, while HDBSCAN selects stable clusters from a density hierarchy, ideal for complex datasets.

	Partitioning-based methods divide data by optimizing an objective function, automatically estimating the number of clusters (''k''). X-means extends k-means by splitting clusters and using the Bayesian Information Criterion (BIC) to select ''k'', assuming spherical clusters. G-means, similarly, splits clusters based on Gaussian distribution tests, fitting well-separated data.

	Hierarchical methods construct a cluster hierarchy, either merging (agglomerative) or splitting (divisive), and choose the optimal level using metrics like silhouette score. Agglomerative clustering merges clusters via linkage criteria but is computationally intensive, while BIRCH incrementally builds a tree for large datasets, balancing speed and accuracy.

	Model-based methods assume data arises from probability distributions, estimating both clusters and parameters. Gaussian Mixture Models (GMM) fit Gaussian distributions, selecting ''k'' with BIC or AIC, but struggle with non-Gaussian data. Variational Bayesian methods apply Bayesian inference for robust cluster estimation, reducing overfitting. Each approach suits specific data characteristics, such as cluster shape or noise levels, with selection depending on the dataset and application.

	== Centroid-based ==		== Centroid-based ==

Aasimayaz: Created a new section about automatic clustering algorithms

2025-05-14T04:25:53Z

Created a new section about automatic clustering algorithms

← Previous revision		Revision as of 04:25, 14 May 2025
Line 7:		Line 7:

	The evolution of automatic clustering algorithms marks a significant milestone in unsupervised learning, enabling more efficient and scalable analysis of complex datasets. These methods empower data-driven discovery by automating a critical aspect of the clustering process, making them essential tools in fields ranging from scientific research to commercial data analytics.		The evolution of automatic clustering algorithms marks a significant milestone in unsupervised learning, enabling more efficient and scalable analysis of complex datasets. These methods empower data-driven discovery by automating a critical aspect of the clustering process, making them essential tools in fields ranging from scientific research to commercial data analytics.

			== Types of Automatic Clustering Algorithms ==
			Automatic clustering algorithms are grouped by their approach to determining cluster numbers and grouping data. Density-based methods identify clusters as high-density regions separated by low-density areas, estimating cluster counts without predefined input. For example, DBSCAN groups points within a radius (ε) with sufficient neighbors (MinPts), marking outliers as noise, and excels with arbitrary shapes but falters with varying densities. OPTICS extends DBSCAN with hierarchical density analysis for better handling of density variations, while HDBSCAN selects stable clusters from a density hierarchy, ideal for complex datasets.

			Partitioning-based methods divide data by optimizing an objective function, automatically estimating the number of clusters (''k''). X-means extends k-means by splitting clusters and using the Bayesian Information Criterion (BIC) to select ''k'', assuming spherical clusters. G-means, similarly, splits clusters based on Gaussian distribution tests, fitting well-separated data.

			Hierarchical methods construct a cluster hierarchy, either merging (agglomerative) or splitting (divisive), and choose the optimal level using metrics like silhouette score. Agglomerative clustering merges clusters via linkage criteria but is computationally intensive, while BIRCH incrementally builds a tree for large datasets, balancing speed and accuracy.

			Model-based methods assume data arises from probability distributions, estimating both clusters and parameters. Gaussian Mixture Models (GMM) fit Gaussian distributions, selecting ''k'' with BIC or AIC, but struggle with non-Gaussian data. Variational Bayesian methods apply Bayesian inference for robust cluster estimation, reducing overfitting. Each approach suits specific data characteristics, such as cluster shape or noise levels, with selection depending on the dataset and application.

	== Centroid-based ==		== Centroid-based ==

Aasimayaz: grammatical fix in the background

2025-05-14T04:23:17Z

grammatical fix in the background

← Previous revision		Revision as of 04:23, 14 May 2025
Line 2:		Line 2:

	== Background ==		== Background ==
	In data mining and machine learning, ~~clustering~~ is an unsupervised learning ~~technique~~ ~~used~~ ~~to group~~ similar data points into clusters based on defined similarity ~~metrics~~. Unlike supervised learning, ~~where~~ labeled data ~~guides~~ ~~the~~ ~~model~~, clustering operates without ~~prior knowledge of~~ class labels, ~~aiming instead~~ to ~~discover~~ ~~inherent~~ ~~groupings~~ within the ~~dataset~~. A ~~central~~ challenge in clustering is determining the optimal number of clusters (k). Traditional ~~clustering~~ algorithms ~~like~~ k-means require ~~the~~ ~~number~~ of ~~clusters~~ to be ~~specified~~ ~~beforehand.~~ ~~However,~~ in ~~many~~ real-world ~~scenarios—such~~ as ~~customer~~ segmentation, ~~anomaly~~ ~~detection~~, or ~~gene~~ ~~expression~~ ~~analysis—the~~ ~~appropriate~~ ~~value~~ of ~~''k''~~ is ~~not~~ ~~known~~ a ~~priori~~ ~~and~~ ~~may~~ ~~be highly sensitive to the structure of the data~~.		Clustering, a core technique in data mining and machine learning, is an unsupervised learning method that groups similar data points into clusters based on defined similarity measures, such as Euclidean distance or cosine similarity. Unlike supervised learning, which relies on labeled data to train models, clustering operates without predefined class labels, seeking to uncover natural patterns or structures within the data. A key challenge in clustering is determining the optimal number of clusters (often denoted as ''k''), as this value significantly influences the quality and interpretability of the results. Traditional algorithms, such as k-means, require users to specify ''k'' in advance, which can be problematic in real-world applications like market segmentation, image analysis, or bioinformatics, where the true number of clusters is unknown or data complexity obscures clear groupings.

	Automatic clustering algorithms ~~are designed to~~ address this ~~limitation~~ by estimating the number of clusters ~~automatically~~ as ~~part~~ of the ~~clustering~~ ~~process.~~ ~~'''Automatic~~ ~~clustering~~ ~~algorithms~~''' ~~are~~ algorithms ~~that~~ ~~can~~ ~~perform~~ ~~clustering~~ ~~without~~ ~~prior~~ ~~knowledge~~ of ~~data~~ ~~sets.~~ In ~~contrast~~ ~~with~~ ~~other~~ ~~[[cluster~~ ~~analysis]] techniques~~, ~~automatic~~ ~~clustering~~ ~~algorithms~~ ~~can~~ ~~determine~~ ~~the~~ optimal number of clusters even in ~~the~~ ~~presence of~~ noise ~~and~~ ~~outlier points~~.~~<ref>[[Outlier]]</ref>~~ ~~These~~ ~~methods~~ ~~incorporate~~ ~~strategies~~ ~~such~~ as ~~statistical model selection (e.g.~~, ~~BIC~~ or ~~AIC),~~ ~~density~~ ~~estimation,~~ or ~~hierarchical~~ ~~merging/splitting~~ to ~~adaptively~~ ~~find a suitable number of clusters~~. Their ability to ~~operate~~ without ~~manual~~ ~~input~~ makes them ~~particularly useful~~ for exploratory data analysis ~~and~~ large-scale applications where ~~user~~ ~~supervision~~ is ~~limited~~.		Automatic clustering algorithms address this challenge by autonomously estimating the number of clusters during the clustering process, eliminating the need for manual specification of ''k''. These algorithms employ techniques such as statistical criteria (e.g., Bayesian Information Criterion or Akaike Information Criterion), density-based approaches, or hierarchical splitting/merging to identify an optimal number of clusters, even in datasets with noise or outliers. By adapting to the data’s inherent structure, automatic clustering algorithms enhance the robustness and flexibility of unsupervised learning. Their ability to function without prior knowledge makes them invaluable for exploratory data analysis, large-scale data processing, and applications where human intervention is impractical.

	The ~~development~~ of automatic clustering algorithms ~~represents~~ an ~~important~~ ~~advancement~~ in unsupervised learning, ~~allowing for~~ more ~~autonomous~~ and data-driven discovery ~~processes~~ in ~~complex~~ ~~datasets.~~ ~~{{context~~ ~~needed\|date=September~~ ~~2021}}~~		The evolution of automatic clustering algorithms marks a significant milestone in unsupervised learning, enabling more efficient and scalable analysis of complex datasets. These methods empower data-driven discovery by automating a critical aspect of the clustering process, making them essential tools in fields ranging from scientific research to commercial data analytics.

	== Centroid-based ==		== Centroid-based ==

Aasimayaz: changed the heading style

2025-05-14T01:24:57Z

changed the heading style

← Previous revision		Revision as of 01:24, 14 May 2025
Line 1:		Line 1:
	{{short description\|Data processing algorithm}}		{{short description\|Data processing algorithm}}

	=== ~~'''~~Background~~'''~~ ===		== Background ==
	In data mining and machine learning, clustering is an unsupervised learning technique used to group similar data points into clusters based on defined similarity metrics. Unlike supervised learning, where labeled data guides the model, clustering operates without prior knowledge of class labels, aiming instead to discover inherent groupings within the dataset. A central challenge in clustering is determining the optimal number of clusters (k). Traditional clustering algorithms like k-means require the number of clusters to be specified beforehand. However, in many real-world scenarios—such as customer segmentation, anomaly detection, or gene expression analysis—the appropriate value of ''k'' is not known a priori and may be highly sensitive to the structure of the data.		In data mining and machine learning, clustering is an unsupervised learning technique used to group similar data points into clusters based on defined similarity metrics. Unlike supervised learning, where labeled data guides the model, clustering operates without prior knowledge of class labels, aiming instead to discover inherent groupings within the dataset. A central challenge in clustering is determining the optimal number of clusters (k). Traditional clustering algorithms like k-means require the number of clusters to be specified beforehand. However, in many real-world scenarios—such as customer segmentation, anomaly detection, or gene expression analysis—the appropriate value of ''k'' is not known a priori and may be highly sensitive to the structure of the data.

Aasimayaz: Added a new about background to highlight the problem

2025-05-14T01:24:36Z

Added a new about background to highlight the problem

← Previous revision		Revision as of 01:24, 14 May 2025
Line 1:		Line 1:
	{{short description\|Data processing algorithm}}		{{short description\|Data processing algorithm}}

			=== '''Background''' ===
	'''Automatic clustering algorithms''' are algorithms that can perform clustering without prior knowledge of data sets. In contrast with other [[cluster analysis]] techniques, automatic clustering algorithms can determine the optimal number of clusters even in the presence of noise and outlier points.<ref>[[Outlier]]</ref>{{context needed\|date=September 2021}}
			In data mining and machine learning, clustering is an unsupervised learning technique used to group similar data points into clusters based on defined similarity metrics. Unlike supervised learning, where labeled data guides the model, clustering operates without prior knowledge of class labels, aiming instead to discover inherent groupings within the dataset. A central challenge in clustering is determining the optimal number of clusters (k). Traditional clustering algorithms like k-means require the number of clusters to be specified beforehand. However, in many real-world scenarios—such as customer segmentation, anomaly detection, or gene expression analysis—the appropriate value of ''k'' is not known a priori and may be highly sensitive to the structure of the data.

			Automatic clustering algorithms are designed to address this limitation by estimating the number of clusters automatically as part of the clustering process. '''Automatic clustering algorithms''' are algorithms that can perform clustering without prior knowledge of data sets. In contrast with other [[cluster analysis]] techniques, automatic clustering algorithms can determine the optimal number of clusters even in the presence of noise and outlier points.<ref>[[Outlier]]</ref> These methods incorporate strategies such as statistical model selection (e.g., BIC or AIC), density estimation, or hierarchical merging/splitting to adaptively find a suitable number of clusters. Their ability to operate without manual input makes them particularly useful for exploratory data analysis and large-scale applications where user supervision is limited.

			The development of automatic clustering algorithms represents an important advancement in unsupervised learning, allowing for more autonomous and data-driven discovery processes in complex datasets. {{context needed\|date=September 2021}}

	== Centroid-based ==		== Centroid-based ==

Headbomb: /* Density-based */ | Altered template type. Add: isbn, title, chapter, authors 1-5. | Use this tool. Report bugs. | #UCB_Gadget

2025-05-11T01:11:00Z

Density-based: | Altered template type. Add: isbn, title, chapter, authors 1-5. | Use this tool. Report bugs. | #UCB_Gadget

← Previous revision		Revision as of 01:11, 11 May 2025
Line 24:		Line 24:
	The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding geographical location and distance to a particular number of neighbors. It is considered autonomous because a priori knowledge on what is a cluster is not required.<ref>{{Cite web\|url=http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-density-based-clustering-works.htm\|title=How Density-based Clustering works—ArcGIS Pro {{!}} ArcGIS Desktop\|website=pro.arcgis.com\|language=en\|access-date=2018-11-05}}</ref> This type of algorithm provides different methods to find clusters in the data. The fastest method is [[DBSCAN]], which uses a defined distance to differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust by using a range of distances instead of a specified one. Lastly, the method [[OPTICS algorithm\|OPTICS]] creates a reachability plot based on the distance from neighboring features to separate noise from clusters of varying density.		The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding geographical location and distance to a particular number of neighbors. It is considered autonomous because a priori knowledge on what is a cluster is not required.<ref>{{Cite web\|url=http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-density-based-clustering-works.htm\|title=How Density-based Clustering works—ArcGIS Pro {{!}} ArcGIS Desktop\|website=pro.arcgis.com\|language=en\|access-date=2018-11-05}}</ref> This type of algorithm provides different methods to find clusters in the data. The fastest method is [[DBSCAN]], which uses a defined distance to differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust by using a range of distances instead of a specified one. Lastly, the method [[OPTICS algorithm\|OPTICS]] creates a reachability plot based on the distance from neighboring features to separate noise from clusters of varying density.

	These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. ''<ref>{{Cite book~~\|title=An algorithm for automatic~~ ~~recognition of cluster centers based on local density clustering - IEEE Conference Publication~~\|date=May 2017 \|pages=1347–1351 \|language=en-US\|doi=10.1109/CCDC.2017.7978726\|isbn=978-1-5090-4657-7 \|s2cid=23267464 }}</ref>''		These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. ''<ref>{{Cite book \|date=May 2017 \|pages=1347–1351 \|language=en-US\|doi=10.1109/CCDC.2017.7978726\|isbn=978-1-5090-4657-7 \|s2cid=23267464 \|chapter=An algorithm for automatic recognition of cluster centers based on local density clustering \|title=2017 29th Chinese Control and Decision Conference (CCDC) \|last1=Xuanzuo \|first1=Ye \|last2=Dinghao \|first2=Li \|last3=Xiongxiong \|first3=He }}</ref>''

	In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite ~~journal~~\|~~journal~~=2012 ~~IEEE~~ ~~Congress on Evolutionary~~ ~~Computation~~\|~~title~~=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms - IEEE ~~Conference~~ ~~Publication~~\|~~date~~=~~June~~ ~~2012~~ \|~~pages~~=~~1–7~~ \|~~language~~=~~en-US~~\|~~doi~~=10.~~1109/CEC~~.~~2012~~.~~6252874~~\|~~citeseerx~~=10.1~~.1.308.9977~~}}</ref>		In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite book \|date=June 2012 \|pages=1–7 \|language=en-US\|doi=10.1109/CEC.2012.6252874\|citeseerx=10.1.1.308.9977 \|chapter=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms \|title=2012 IEEE Congress on Evolutionary Computation \|last1=Meiguins \|first1=Aruanda S. G. \|last2=Limao \|first2=Roberto C. \|last3=Meiguins \|first3=Bianchi S. \|last4=Junior \|first4=Samuel F. S. \|last5=Freitas \|first5=Alex A. \|isbn=978-1-4673-1509-8 }}</ref>

	== References ==		== References ==

LooksGreatInATurtleNeck: There was a Script warning on the page from a cite journal template, "Cite journal requires |journal=", fixed by adding a journal= field & filling it in

2025-03-19T17:30:22Z

There was a Script warning on the page from a cite journal template, "Cite journal requires |journal=", fixed by adding a journal= field & filling it in

← Previous revision		Revision as of 17:30, 19 March 2025
Line 26:		Line 26:
	These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. ''<ref>{{Cite book\|title=An algorithm for automatic recognition of cluster centers based on local density clustering - IEEE Conference Publication\|date=May 2017 \|pages=1347–1351 \|language=en-US\|doi=10.1109/CCDC.2017.7978726\|isbn=978-1-5090-4657-7 \|s2cid=23267464 }}</ref>''		These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. ''<ref>{{Cite book\|title=An algorithm for automatic recognition of cluster centers based on local density clustering - IEEE Conference Publication\|date=May 2017 \|pages=1347–1351 \|language=en-US\|doi=10.1109/CCDC.2017.7978726\|isbn=978-1-5090-4657-7 \|s2cid=23267464 }}</ref>''

	In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite journal\|title=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms - IEEE Conference Publication\|date=June 2012 \|pages=1–7 \|language=en-US\|doi=10.1109/CEC.2012.6252874\|citeseerx=10.1.1.308.9977}}</ref>		In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite journal\|journal=2012 IEEE Congress on Evolutionary Computation\|title=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms - IEEE Conference Publication\|date=June 2012 \|pages=1–7 \|language=en-US\|doi=10.1109/CEC.2012.6252874\|citeseerx=10.1.1.308.9977}}</ref>

	== References ==		== References ==

LooksGreatInATurtleNeck: There was a Script warning on the page from a cite book template, "Category:CS1 maint: date and year", fixed by removing the redundant year= field as date= was already set

2025-03-19T17:24:21Z

There was a Script warning on the page from a cite book template, "Category:CS1 maint: date and year", fixed by removing the redundant year= field as date= was already set

← Previous revision		Revision as of 17:24, 19 March 2025
Line 8:		Line 8:
	Automated selection of ''k'' in a [[K-means clustering\|''K''-means clustering algorithm]], one of the most used centroid-based clustering algorithms, is still a major problem in machine learning. The most accepted solution to this problem is the [[Elbow method (clustering)\|elbow method]]. It consists of running ''k''-means clustering to the data set with a range of values, calculating the sum of squared errors for each, and plotting them in a line chart. If the chart looks like an arm, the best value of ''k'' will be on the "elbow".<ref>{{Cite web\|url=https://bl.ocks.org/rpgove/0060ff3b656618e9136b\|title=Using the elbow method to determine the optimal number of clusters for k-means clustering\|website=bl.ocks.org\|access-date=2018-11-12}}</ref>		Automated selection of ''k'' in a [[K-means clustering\|''K''-means clustering algorithm]], one of the most used centroid-based clustering algorithms, is still a major problem in machine learning. The most accepted solution to this problem is the [[Elbow method (clustering)\|elbow method]]. It consists of running ''k''-means clustering to the data set with a range of values, calculating the sum of squared errors for each, and plotting them in a line chart. If the chart looks like an arm, the best value of ''k'' will be on the "elbow".<ref>{{Cite web\|url=https://bl.ocks.org/rpgove/0060ff3b656618e9136b\|title=Using the elbow method to determine the optimal number of clusters for k-means clustering\|website=bl.ocks.org\|access-date=2018-11-12}}</ref>

	Another method that modifies the ''k''-means algorithm for automatically choosing the optimal number of clusters is the ''G''-means algorithm. It was developed from the hypothesis that a subset of the data follows a Gaussian distribution. Thus, ''k'' is increased until each ''k''-means center's data is Gaussian. This algorithm only requires the standard statistical significance level as a parameter and does not set limits for the covariance of the data.<ref>{{cite conference \|url=https://proceedings.neurips.cc/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf \|title=Learning the k in k-means \|last1=Hamerly \|first1=Greg \|last2=Elkan \|first2=Charles \|date=9 December ~~2003 \|year=~~2003 \|conference=Proceedings of the 16th International Conference on Neural Information Processing Systems \|conference-url=https://dl.acm.org/doi/proceedings/10.5555/2981345 \|editor=Sebastian Thrun \|editor2=Lawrence K Saul \|editor3=Bernhard H Schölkopf\|publisher=MIT Press \|archive-url=https://web.archive.org/web/20221016235553/https://proceedings.neurips.cc/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf \|archive-date=16 October 2022 \|location=Whistler, British Columbia, Canada \|pages=281–288 \|access-date=3 November 2022 \|quote= \|language=en-us }}</ref>		Another method that modifies the ''k''-means algorithm for automatically choosing the optimal number of clusters is the ''G''-means algorithm. It was developed from the hypothesis that a subset of the data follows a Gaussian distribution. Thus, ''k'' is increased until each ''k''-means center's data is Gaussian. This algorithm only requires the standard statistical significance level as a parameter and does not set limits for the covariance of the data.<ref>{{cite conference \|url=https://proceedings.neurips.cc/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf \|title=Learning the k in k-means \|last1=Hamerly \|first1=Greg \|last2=Elkan \|first2=Charles \|date=9 December 2003 \|conference=Proceedings of the 16th International Conference on Neural Information Processing Systems \|conference-url=https://dl.acm.org/doi/proceedings/10.5555/2981345 \|editor=Sebastian Thrun \|editor2=Lawrence K Saul \|editor3=Bernhard H Schölkopf\|publisher=MIT Press \|archive-url=https://web.archive.org/web/20221016235553/https://proceedings.neurips.cc/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf \|archive-date=16 October 2022 \|location=Whistler, British Columbia, Canada \|pages=281–288 \|access-date=3 November 2022 \|quote= \|language=en-us }}</ref>

	== Connectivity-based (hierarchical clustering) ==		== Connectivity-based (hierarchical clustering) ==

Citation bot: Altered template type. Add: isbn, pages, date. | Use this bot. Report bugs. | Suggested by Abductive | Category:Clustering criteria | #UCB_Category 12/20

2025-01-08T00:01:12Z

Altered template type. Add: isbn, pages, date. | Use this bot. Report bugs. | Suggested by Abductive | Category:Clustering criteria | #UCB_Category 12/20

← Previous revision		Revision as of 00:01, 8 January 2025
Line 24:		Line 24:
	The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding geographical location and distance to a particular number of neighbors. It is considered autonomous because a priori knowledge on what is a cluster is not required.<ref>{{Cite web\|url=http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-density-based-clustering-works.htm\|title=How Density-based Clustering works—ArcGIS Pro {{!}} ArcGIS Desktop\|website=pro.arcgis.com\|language=en\|access-date=2018-11-05}}</ref> This type of algorithm provides different methods to find clusters in the data. The fastest method is [[DBSCAN]], which uses a defined distance to differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust by using a range of distances instead of a specified one. Lastly, the method [[OPTICS algorithm\|OPTICS]] creates a reachability plot based on the distance from neighboring features to separate noise from clusters of varying density.		The density-based clustering algorithm uses autonomous machine learning that identifies patterns regarding geographical location and distance to a particular number of neighbors. It is considered autonomous because a priori knowledge on what is a cluster is not required.<ref>{{Cite web\|url=http://pro.arcgis.com/en/pro-app/tool-reference/spatial-statistics/how-density-based-clustering-works.htm\|title=How Density-based Clustering works—ArcGIS Pro {{!}} ArcGIS Desktop\|website=pro.arcgis.com\|language=en\|access-date=2018-11-05}}</ref> This type of algorithm provides different methods to find clusters in the data. The fastest method is [[DBSCAN]], which uses a defined distance to differentiate between dense groups of information and sparser noise. Moreover, HDBSCAN can self-adjust by using a range of distances instead of a specified one. Lastly, the method [[OPTICS algorithm\|OPTICS]] creates a reachability plot based on the distance from neighboring features to separate noise from clusters of varying density.

	These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. ''<ref>{{Cite ~~journal~~\|title=An algorithm for automatic recognition of cluster centers based on local density clustering - IEEE Conference Publication\|language=en-US\|doi=10.1109/CCDC.2017.7978726\|s2cid=23267464 }}</ref>''		These methods still require the user to provide the cluster center and cannot be considered automatic. The Automatic Local Density Clustering Algorithm (ALDC) is an example of the new research focused on developing automatic density-based clustering. ALDC works out local density and distance deviation of every point, thus expanding the difference between the potential cluster center and other points. This expansion allows the machine to work automatically. The machine identifies cluster centers and assigns the points that are left by their closest neighbor of higher density. ''<ref>{{Cite book\|title=An algorithm for automatic recognition of cluster centers based on local density clustering - IEEE Conference Publication\|date=May 2017 \|pages=1347–1351 \|language=en-US\|doi=10.1109/CCDC.2017.7978726\|isbn=978-1-5090-4657-7 \|s2cid=23267464 }}</ref>''

	In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite journal\|title=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms - IEEE Conference Publication\|language=en-US\|doi=10.1109/CEC.2012.6252874\|citeseerx=10.1.1.308.9977}}</ref>		In the automation of data density to identify clusters, research has also been focused on artificially generating the algorithms. For instance, the Estimation of Distribution Algorithms guarantees the generation of valid algorithms by the [[directed acyclic graph]] (DAG), in which nodes represent procedures (building block) and edges represent possible execution sequences between two nodes. Building Blocks determine the EDA's alphabet or, in other words, any generated algorithm. Clustering algorithms artificially generated are compared to DBSCAN, a manual algorithm, in experimental results.<ref>{{Cite journal\|title=AutoClustering: An estimation of distribution algorithm for the automatic generation of clustering algorithms - IEEE Conference Publication\|date=June 2012 \|pages=1–7 \|language=en-US\|doi=10.1109/CEC.2012.6252874\|citeseerx=10.1.1.308.9977}}</ref>

	== References ==		== References ==