Jump to content

Density-based clustering validation: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
m Curb Safe Charmer moved page User:Giuseppe Sabino/sandbox to Draft:Density-Based Clustering Validation: Preferred location for AfC submissions
clean up (DraftCleaner)
Line 6: Line 6:
{{User sandbox}}
{{User sandbox}}


[[File:DBCV clustering evaluation.png|thumb|500px|In each graph, an increasing level of noise is introduced to the initial data, which consist of two well-defined semicircles. As the noise increases and thus the overlap between the two groups, the value of the DBCV index progressively decreases.Image released under MIT license<ref name = felsiq>GitHub

[[File:DBCV clustering evaluation.png|thumb|500px|In each graph, an increasing level of noise is introduced to the initial data, which consist of two well-defined semicircles. As the noise increases and thus the overlap between the two groups, the value of the DBCV index progressively decreases.Image released under MIT license <ref name = felsiq>GitHub
FelSiq/DBCV Fast Density-Based Clustering Validation (DBCV) Python
FelSiq/DBCV Fast Density-Based Clustering Validation (DBCV) Python
package -- https://github.com/FelSiq/DBCV</ref>]]
package -- https://github.com/FelSiq/DBCV</ref>]]



'''Density-Based Clustering Validation (DBCV)''' is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like [[DBSCAN]], [[Mean shift]], and [[OPTICS]].
'''Density-Based Clustering Validation (DBCV)''' is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like [[DBSCAN]], [[Mean shift]], and [[OPTICS]].
This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the [[Silhouette (clustering)|Silhouette coefficient]], [[Davies–Bouldin index]], or [[Calinski–Harabasz index]] often struggle to provide meaningful evaluations.
This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the [[Silhouette (clustering)|Silhouette coefficient]], [[Davies–Bouldin index]], or [[Calinski–Harabasz index]] often struggle to provide meaningful evaluations.


Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence.
Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence.


This metric was introduced in 2014 by by David Moulavi and colleagues in their work <ref name = Moulavi>{{Cite
This metric was introduced in 2014 by by David Moulavi and colleagues in their work.<ref name = Moulavi>{{Cite
| last = Moulavi
| last = Moulavi
| first = Davoud
| first = Davoud
Line 27: Line 25:
| publisher = SIAM
| publisher = SIAM
| url = https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
| url = https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf
}}</ref>. It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable.
}}</ref> It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable.


The DBCV index has been employed in bioinformatics analysis <ref name="Di Giovanni">{{Cite
The DBCV index has been employed in bioinformatics analysis,<ref name="Di Giovanni">{{Cite
| last= Di Giovanni
| last= Di Giovanni
| first= Daniele
| first= Daniele
Line 37: Line 35:
| doi = 10.3390/genes14020313
| doi = 10.3390/genes14020313
| url = https://www.mdpi.com/2073-4425/14/2/313
| url = https://www.mdpi.com/2073-4425/14/2/313
}}</ref>, ecology analysis <ref name="Poutaraud">{{Cite
}}</ref> ecology analysis,<ref name="Poutaraud">{{Cite
| last= Poutaraud
| last= Poutaraud
| first= Joachim
| first= Joachim
| year= 2024
| year= 2024
| title= Meta-Embedded Clustering (MEC): A new method for improving clustering quality in unlabeled bird sound datasets
| title= Meta-Embedded Clustering (MEC): A new method for improving clustering quality in unlabeled bird sound datasets
Line 47: Line 45:
| doi = 10.1016/j.ecoinf.2024.102687
| doi = 10.1016/j.ecoinf.2024.102687
| url = https://www.sciencedirect.com/science/article/pii/S1574954124002292
| url = https://www.sciencedirect.com/science/article/pii/S1574954124002292
}}</ref>, techno-economic analysis <ref name="Shim">{{Cite
}}</ref> techno-economic analysis,<ref name="Shim">{{Cite
| last= Shim
| last= Shim
| first= Jaehyun
| first= Jaehyun
| year= 2022
| year= 2022
| title= Techno-economic analysis of micro-grid system design through climate region clustering
| title= Techno-economic analysis of micro-grid system design through climate region clustering
Line 57: Line 55:
| doi = 10.1016/j.enconman.2022.116411
| doi = 10.1016/j.enconman.2022.116411
| url = https://www.sciencedirect.com/science/article/abs/pii/S019689042201189X
| url = https://www.sciencedirect.com/science/article/abs/pii/S019689042201189X
}}</ref>, and health informatics analysis <ref name="Martinez">{{Cite
}}</ref> and health informatics analysis<ref name="Martinez">{{Cite
| last= Martínez
| last= Martínez
| first= Rubén Yáñez
| first= Rubén Yáñez
| year= 2023
| year= 2023
| title= Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection
| title= Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection
Line 67: Line 65:
| doi = 10.1016/j.ipm.2023.103294
| doi = 10.1016/j.ipm.2023.103294
| url = https://www.sciencedirect.com/science/article/pii/S0306457323000316
| url = https://www.sciencedirect.com/science/article/pii/S0306457323000316
}}</ref> as well as in numerous other fields <ref name=Beer">{{Cite
}}</ref> as well as in numerous other fields<ref name=Beer">{{Cite
| last= Beer
| last= Beer
| first= Anna
| first= Anna
Line 75: Line 73:
| doi = 10.48550/arXiv.2503.00127
| doi = 10.48550/arXiv.2503.00127
| url = https://arxiv.org/abs/2503.00127
| url = https://arxiv.org/abs/2503.00127
}}</ref>
}}</ref>
<ref name="Veigel">{{Cite
<ref name="Veigel">{{Cite
| last= Veigel
| last= Veigel
| first= Nadja
| first= Nadja
| year= 2025
| year= 2025
| title= Content analysis of multi-annual time series of flood-related Twitter (X) data
| title= Content analysis of multi-annual time series of flood-related Twitter (X) data
Line 91: Line 89:
DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset <math>X = {x_1,x_2,...,x_n}</math>, a density-based algorithm partitions it into ''K '' clusters <math>{C_1,C_2,...,C_n}</math>. Each point belongs to a specific cluster, denoted as <math>Cluster(X_i)</math>
DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset <math>X = {x_1,x_2,...,x_n}</math>, a density-based algorithm partitions it into ''K '' clusters <math>{C_1,C_2,...,C_n}</math>. Each point belongs to a specific cluster, denoted as <math>Cluster(X_i)</math>


A key concept in DBCV index is the notion of density-connected paths<ref>{{Cite
A key concept in DBCV index is the notion of density-connected paths.<ref>{{Cite
| last = Ester
| last = Ester
| first = M.
| first = M.
Line 106: Line 104:
| doi = 10.1007/978-0-387-39940-9_605
| doi = 10.1007/978-0-387-39940-9_605
| url = https://doi.org/10.1007/978-0-387-39940-9_605
| url = https://doi.org/10.1007/978-0-387-39940-9_605
}}</ref>. Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The '''density-based distance''' between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory.
}}</ref> Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The '''density-based distance''' between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory.


DBCV index extends the [[Silhouette (clustering)|Silhouette coefficient]] by redefining cluster cohesion and separation using density-based distances:
DBCV index extends the [[Silhouette (clustering)|Silhouette coefficient]] by redefining cluster cohesion and separation using density-based distances:



* '''Within-cluster density distance''' measures how closely a point is related to other members of its cluster:
* '''Within-cluster density distance''' measures how closely a point is related to other members of its cluster:



<math>
<math>
Line 118: Line 114:
</math>
</math>


* '''Nearest-cluster density distance''' quantifies how far a point is from the closest external cluster:
* '''Nearest-cluster density distance''' quantifies how far a point is from the closest external cluster:



<math>
<math>
b_i = \min_{{C \neq C_{\text{cluster}(x_i)} \atop C \in \{C_1,\dots,C_k\}}}
b_i = \min_{{C \neq C_{\text{cluster}(x_i)} \atop C \in \{C_1,\dots,C_k\}}}
\left( \frac{1}{|C|} \sum_{x_j \in C} d_{\text{density}}(x_i, x_j) \right).
\left( \frac{1}{|C|} \sum_{x_j \in C} d_{\text{density}}(x_i, x_j) \right).
</math>
</math>



Using these measures, the '''DBCV index''' is computed as:
Using these measures, the '''DBCV index''' is computed as:
Line 140: Line 134:
* 0: Ambiguous clustering structure.
* 0: Ambiguous clustering structure.
* -1: Poorly formed clusters or incorrect assignments.
* -1: Poorly formed clusters or incorrect assignments.



By leveraging density-based distances instead of traditional [[Euclidean distance|Euclidean measures]], DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions<ref name = Moulavi />
By leveraging density-based distances instead of traditional [[Euclidean distance|Euclidean measures]], DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions<ref name = Moulavi />
Line 147: Line 140:
== Implementations ==
== Implementations ==


* [https://github.com/christopherjenness/DBCV Python DBCV Implementation by Christopher Jennes]
* Python DBCV Implementation by Christopher Jennes<ref>https://github.com/christopherjenness/DBCV</ref>
* Python DBCV Implementation by Felipe Silva<ref>https://github.com/FelSiq/DBCV</ref>

* R DBCV Implementation<ref>https://doi.org/10.32614/CRAN.package.DBCVindex</ref>
* [https://github.com/FelSiq/DBCV Python DBCV Implementation by Felipe Silva]

* [https://doi.org/10.32614/CRAN.package.DBCVindex R DBCV Implementation]


== See also ==
== See also ==
Line 165: Line 156:
<references/>
<references/>


[[Category:Cluster analysis]]
[[:Category:Cluster analysis]]

Revision as of 16:39, 14 April 2025

This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template.

In each graph, an increasing level of noise is introduced to the initial data, which consist of two well-defined semicircles. As the noise increases and thus the overlap between the two groups, the value of the DBCV index progressively decreases.Image released under MIT license[1]

Density-Based Clustering Validation (DBCV) is a metric designed to assess the quality of clustering solutions, particularly for density-based clustering algorithms like DBSCAN, Mean shift, and OPTICS. This metric is particularly suited for identifying concave and nested clusters, where traditional metrics such as the Silhouette coefficient, Davies–Bouldin index, or Calinski–Harabasz index often struggle to provide meaningful evaluations.

Unlike traditional validation measures, which often rely on compact and well-separated clusters, DBCV index evaluates how well clusters are defined in terms of local density variations and structural coherence.

This metric was introduced in 2014 by by David Moulavi and colleagues in their work.[2] It utilizes density connectivity principles to quantify clustering structures, making it especially effective at detecting arbitrarily shaped clusters in concave datasets, where traditional metrics may be less reliable.

The DBCV index has been employed in bioinformatics analysis,[3] ecology analysis,[4] techno-economic analysis,[5] and health informatics analysis[6] as well as in numerous other fields[7] [8]

Definition

DBCV index evaluates clustering structures by analyzing the relationships between data points within and across clusters. Given a dataset , a density-based algorithm partitions it into K clusters . Each point belongs to a specific cluster, denoted as

A key concept in DBCV index is the notion of density-connected paths.[9] Two points within the same cluster are considered density-connected if there exists a sequence of intermediate points linking them, where each consecutive pair meets a predefined density criterion. The density-based distance between two points is determined by identifying the optimal path that minimizes the maximum local reachability distance along its trajectory.

DBCV index extends the Silhouette coefficient by redefining cluster cohesion and separation using density-based distances:

  • Within-cluster density distance measures how closely a point is related to other members of its cluster:

  • Nearest-cluster density distance quantifies how far a point is from the closest external cluster:

Using these measures, the DBCV index is computed as:

Explanation

DBCV index values range between -1 and +1:

  • +1: Strongly cohesive and well-separated clusters.
  • 0: Ambiguous clustering structure.
  • -1: Poorly formed clusters or incorrect assignments.

By leveraging density-based distances instead of traditional Euclidean measures, DBCV index provides a more robust evaluation of clustering performance in datasets with irregular or non-spherical distributions[2] .

Implementations

  • Python DBCV Implementation by Christopher Jennes[10]
  • Python DBCV Implementation by Felipe Silva[11]
  • R DBCV Implementation[12]

See also

References

  1. ^ GitHub FelSiq/DBCV Fast Density-Based Clustering Validation (DBCV) Python package -- https://github.com/FelSiq/DBCV
  2. ^ a b Moulavi, Davoud (2014), "Density-based clustering validation" (PDF), Proceedings of the 2014 SIAM International Conference on Data Mining, SIAM: 839–847, doi:10.1137/1.9781611973440.96
  3. ^ Di Giovanni, Daniele (2023), "Using machine learning to explore shared genetic pathways and possible endophenotypes in autism spectrum disorder", Genes, doi:10.3390/genes14020313{{citation}}: CS1 maint: unflagged free DOI (link)
  4. ^ Poutaraud, Joachim (2024), "Meta-Embedded Clustering (MEC): A new method for improving clustering quality in unlabeled bird sound datasets", Ecological Informatics, Elsevier: 102687, doi:10.1016/j.ecoinf.2024.102687
  5. ^ Shim, Jaehyun (2022), "Techno-economic analysis of micro-grid system design through climate region clustering", Energy Conversion and Management, Elsevier: 116411, doi:10.1016/j.enconman.2022.116411
  6. ^ Martínez, Rubén Yáñez (2023), "Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection", Information Processing \& Management, Elsevier: 103294, doi:10.1016/j.ipm.2023.103294
  7. ^ Beer, Anna (2025), "DISCO: Internal Evaluation of Density-Based Clustering", arXiv preprint arXiv:2503.00127, doi:10.48550/arXiv.2503.00127
  8. ^ Veigel, Nadja (2025), "Content analysis of multi-annual time series of flood-related Twitter (X) data", Natural Hazards and Earth System Sciences, Copernicus Publications Gottingen, Germany: 879--891, doi:10.5194/nhess-25-879-2025{{citation}}: CS1 maint: unflagged free DOI (link)
  9. ^ Ester, M. (2009), Liu, L.; Özsu, M.T. (eds.), "Density-based Clustering", Encyclopedia of Database Systems, Boston, MA: Springer, doi:10.1007/978-0-387-39940-9_605, ISBN 978-0-387-35544-3
  10. ^ https://github.com/christopherjenness/DBCV
  11. ^ https://github.com/FelSiq/DBCV
  12. ^ https://doi.org/10.32614/CRAN.package.DBCVindex

Category:Cluster analysis