Unicode collation algorithm: Difference between revisions

Content deleted Content added

Inline

Revision as of 01:39, 28 January 2024

The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently byte-by-byte compared in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.^[1]

Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET), this data file specifies a default collation ordering, the DUCET is customizable for different languages.^[1]^[2] Some such customizations can be found in the Unicode Common Locale Data Repository (CLDR).^[3]

An open source implementation of UCA is included with the International Components for Unicode, ICU.^[4]^[5] ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.^[6]^[2]

References

^ ^a ^b Whistler, Ken; Scherer, Markus; Davis, Mark (2022-08-26). "UTS #10: Unicode Collation Algorithm". Unicode. Retrieved 2023-08-16.
^ ^a ^b Hosken, Martin (2021-09-23). Unicode Sort Tailoring: Tutorial (PDF) (1.3 ed.). SIL Writing Systems Technology. pp. 2–3. Retrieved 2023-08-16.
^ "CLDR Releases/Downloads". Unicode CLDR. Retrieved 2023-08-16.
^ "ICU - International Components for Unicode". Unicode. Retrieved 2023-08-16.
^ "Collations". SyBooks Online. Retrieved 2023-08-16.
^ "Customization". ICU Documentation. Retrieved 2023-08-16.

External links

Unicode Collation Algorithm: Unicode Technical Standard #10
Mimer SQL Unicode Collation Charts

Tools

ICU Locale Explorer An online demonstration of the Unicode Collation Algorithm using International Components for Unicode , as of 2023-08-16 it's not working.
An ICU collation demo, as of 2023-08-16 it's not working.
msort A sort program that provides an unusual level of flexibility in defining collations and extracting keys.

This algorithms or data structures-related article is a stub. You can help Wikipedia by expanding it.

This standards- or measurement-related article is a stub. You can help Wikipedia by expanding it.

[:0-1] Whistler, Ken; Scherer, Markus; Davis, Mark (2022-08-26). "UTS #10: Unicode Collation Algorithm". Unicode. Retrieved 2023-08-16.

[:1-2] Hosken, Martin (2021-09-23). Unicode Sort Tailoring: Tutorial (PDF) (1.3 ed.). SIL Writing Systems Technology. pp. 2–3. Retrieved 2023-08-16.

[3] "CLDR Releases/Downloads". Unicode CLDR. Retrieved 2023-08-16.

[4] "ICU - International Components for Unicode". Unicode. Retrieved 2023-08-16.

[5] "Collations". SyBooks Online. Retrieved 2023-08-16.

[6] "Customization". ICU Documentation. Retrieved 2023-08-16.

[1]

[2]

[3]

[4]

[5]

[6]

@@ Line 1: / Line 1: @@
 The '''Unicode collation algorithm''' ('''UCA''') is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from [[String (computer science)|strings]] representing text in any [[writing system]] and [[language]] that can be represented with [[Unicode]]. These keys can then be efficiently byte-by-byte compared in order to [[collate]] or sort them according to the rules of the language, with options for ignoring case, accents, etc.<ref name=":0">{{Cite web |last=Whistler |first=Ken |last2=Scherer |first2=Markus |last3=Davis |first3=Mark |author-link3=Mark Davis (Unicode) |date=2022-08-26 |title=UTS #10: Unicode Collation Algorithm |url=https://www.unicode.org/reports/tr10/ |access-date=2023-08-16 |website=[[Unicode]]}}</ref>
-Unicode Technical Report #10 also specifies the ''Default Unicode Collation Element Table'' (DUCET), this data file specifies a default collation ordering, the DUCET is customizable for different languages.<ref name=":0" /><ref name=":1">{{Cite book |last=Hosken |first=Martin |url=https://scriptsource.org/cms/scripts/render_download.php?format=file&media_id=..%2Fsites%2Fs%2Fmedia%2Fdatabase%2Fssproto%2Fentries%2Fpn%2Frn%2Fpnrnlhkrq9_sort_tutorial.pdf&filename=sort_tutorial.pdf |title=Unicode Sort Tailoring: Tutorial |date=2021-09-23 |publisher=[[SIL International|SIL Writing Systems Technology]] |edition=1.3 |pages=2-3 |format=PDF |access-date=2023-08-16}}</ref> Some such customizations can be found in the Unicode [[Common Locale Data Repository]] (CLDR).<ref>{{Cite web |title=CLDR Releases/Downloads |url=https://cldr.unicode.org/index/downloads |access-date=2023-08-16 |website=[[Common Locale Data Repository|Unicode CLDR]] |language=}}</ref>
+Unicode Technical Report #10 also specifies the ''Default Unicode Collation Element Table'' (DUCET), this data file specifies a default collation ordering, the DUCET is customizable for different languages.<ref name=":0" /><ref name=":1">{{Cite book |last=Hosken |first=Martin |url=https://scriptsource.org/cms/scripts/render_download.php?format=file&media_id=..%2Fsites%2Fs%2Fmedia%2Fdatabase%2Fssproto%2Fentries%2Fpn%2Frn%2Fpnrnlhkrq9_sort_tutorial.pdf&filename=sort_tutorial.pdf |title=Unicode Sort Tailoring: Tutorial |date=2021-09-23 |publisher=[[SIL International|SIL Writing Systems Technology]] |edition=1.3 |pages=2–3 |format=PDF |access-date=2023-08-16}}</ref> Some such customizations can be found in the Unicode [[Common Locale Data Repository]] (CLDR).<ref>{{Cite web |title=CLDR Releases/Downloads |url=https://cldr.unicode.org/index/downloads |access-date=2023-08-16 |website=[[Common Locale Data Repository|Unicode CLDR]] |language=}}</ref>
 An open source implementation of UCA is included with the [[International Components for Unicode]], ICU.<ref>{{Cite web |title=ICU - International Components for Unicode |url=https://icu.unicode.org/home |access-date=2023-08-16 |website=[[Unicode]]}}</ref><ref>{{Cite web |title=Collations |url=https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbadmin/natlang-s-7003956.html |access-date=2023-08-16 |website=SyBooks Online}}</ref> ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.<ref>{{Cite web |title=Customization |url=https://unicode-org.github.io/icu/userguide/collation/customization/ |access-date=2023-08-16 |website=ICU Documentation |language=}}</ref><ref name=":1" />
-== References ==
-<references />
 ==See also==
@@ Line 13: / Line 10: @@
 * [[European ordering rules]] (EOR)
 * [[Common Locale Data Repository]] (CLDR)
+== References ==
+<references />
 ==External links==
@@ Line 28: / Line 28: @@
 [[Category:Unicode algorithms|Collation]]
 [[Category:Collation]]
 {{algorithm-stub}}