Unicode collation algorithm: Difference between revisions
run-on sentences |
Citation bot (talk | contribs) Add: authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:Algorithms and data structures stubs | #UCB_Category 46/92 |
||
Line 1: | Line 1: | ||
The '''Unicode collation algorithm''' ('''UCA''') is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from [[String (computer science)|strings]] representing text in any [[writing system]] and [[language]] that can be represented with [[Unicode]]. These keys can then be efficiently compared byte by byte in order to [[collate]] or sort them according to the rules of the language, with options for ignoring case, accents, etc.<ref name=":0">{{Cite web | |
The '''Unicode collation algorithm''' ('''UCA''') is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from [[String (computer science)|strings]] representing text in any [[writing system]] and [[language]] that can be represented with [[Unicode]]. These keys can then be efficiently compared byte by byte in order to [[collate]] or sort them according to the rules of the language, with options for ignoring case, accents, etc.<ref name=":0">{{Cite web |last1=Whistler |first1=Ken |last2=Scherer |first2=Markus |last3=Davis |first3=Mark |author-link3=Mark Davis (Unicode) |date=2022-08-26 |title=UTS #10: Unicode Collation Algorithm |url=https://www.unicode.org/reports/tr10/ |access-date=2023-08-16 |website=[[Unicode]]}}</ref> |
||
Unicode Technical Report #10 also specifies the ''Default Unicode Collation Element Table'' (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages,<ref name=":0" /><ref name=":1">{{Cite book |last=Hosken |first=Martin |url=https://scriptsource.org/cms/scripts/render_download.php?format=file&media_id=..%2Fsites%2Fs%2Fmedia%2Fdatabase%2Fssproto%2Fentries%2Fpn%2Frn%2Fpnrnlhkrq9_sort_tutorial.pdf&filename=sort_tutorial.pdf |title=Unicode Sort Tailoring: Tutorial |date=2021-09-23 |publisher=[[SIL International|SIL Writing Systems Technology]] |edition=1.3 |pages=2–3 |
Unicode Technical Report #10 also specifies the ''Default Unicode Collation Element Table'' (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages,<ref name=":0" /><ref name=":1">{{Cite book |last=Hosken |first=Martin |url=https://scriptsource.org/cms/scripts/render_download.php?format=file&media_id=..%2Fsites%2Fs%2Fmedia%2Fdatabase%2Fssproto%2Fentries%2Fpn%2Frn%2Fpnrnlhkrq9_sort_tutorial.pdf&filename=sort_tutorial.pdf |title=Unicode Sort Tailoring: Tutorial |date=2021-09-23 |publisher=[[SIL International|SIL Writing Systems Technology]] |edition=1.3 |pages=2–3 |access-date=2023-08-16}}</ref> and some such customizations can be found in the Unicode [[Common Locale Data Repository]] (CLDR).<ref>{{Cite web |title=CLDR Releases/Downloads |url=https://cldr.unicode.org/index/downloads |access-date=2023-08-16 |website=[[Common Locale Data Repository|Unicode CLDR]] |language=}}</ref> |
||
An open source implementation of UCA is included with the [[International Components for Unicode]], ICU.<ref>{{Cite web |title=ICU - International Components for Unicode |url=https://icu.unicode.org/home |access-date=2023-08-16 |website=[[Unicode]]}}</ref><ref>{{Cite web |title=Collations |url=https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbadmin/natlang-s-7003956.html |access-date=2023-08-16 |website=SyBooks Online}}</ref> ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.<ref>{{Cite web |title=Customization |url=https://unicode-org.github.io/icu/userguide/collation/customization/ |access-date=2023-08-16 |website=ICU Documentation |language=}}</ref><ref name=":1" /> |
An open source implementation of UCA is included with the [[International Components for Unicode]], ICU.<ref>{{Cite web |title=ICU - International Components for Unicode |url=https://icu.unicode.org/home |access-date=2023-08-16 |website=[[Unicode]]}}</ref><ref>{{Cite web |title=Collations |url=https://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.help.sqlanywhere.12.0.1/dbadmin/natlang-s-7003956.html |access-date=2023-08-16 |website=SyBooks Online}}</ref> ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.<ref>{{Cite web |title=Customization |url=https://unicode-org.github.io/icu/userguide/collation/customization/ |access-date=2023-08-16 |website=ICU Documentation |language=}}</ref><ref name=":1" /> |
Revision as of 16:40, 18 October 2024
The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with Unicode. These keys can then be efficiently compared byte by byte in order to collate or sort them according to the rules of the language, with options for ignoring case, accents, etc.[1]
Unicode Technical Report #10 also specifies the Default Unicode Collation Element Table (DUCET). This data file specifies a default collation ordering. The DUCET is customizable for different languages,[1][2] and some such customizations can be found in the Unicode Common Locale Data Repository (CLDR).[3]
An open source implementation of UCA is included with the International Components for Unicode, ICU.[4][5] ICU supports tailoring, and the collation tailorings from CLDR are included in ICU.[6][2]
See also
References
- ^ a b Whistler, Ken; Scherer, Markus; Davis, Mark (2022-08-26). "UTS #10: Unicode Collation Algorithm". Unicode. Retrieved 2023-08-16.
- ^ a b Hosken, Martin (2021-09-23). Unicode Sort Tailoring: Tutorial (PDF) (1.3 ed.). SIL Writing Systems Technology. pp. 2–3. Retrieved 2023-08-16.
- ^ "CLDR Releases/Downloads". Unicode CLDR. Retrieved 2023-08-16.
- ^ "ICU - International Components for Unicode". Unicode. Retrieved 2023-08-16.
- ^ "Collations". SyBooks Online. Retrieved 2023-08-16.
- ^ "Customization". ICU Documentation. Retrieved 2023-08-16.
External links
- Unicode Collation Algorithm: Unicode Technical Standard #10
- Mimer SQL Unicode Collation Charts
Tools
- ICU Locale Explorer An online demonstration of the Unicode Collation Algorithm using International Components for Unicode, as of 2023-08-16 it's not working.
- An ICU collation demo, as of 2023-08-16 it's not working.
- msort A sort program that provides an unusual level of flexibility in defining collations and extracting keys.