Proposal to add `jyutping` as a new collation type

Description

I would like to propose the value jyutping (Cantonese romanization) as a collation type to be added to the list of collation, as well as the addition of Cantonese pronunciation of Chinese (Han) characters to `collation/zh` and `collation/zh-Hant`.

Background

I have been informed by Dr Ken Lunde, the Chair of the CJK & Unihan Working Group and the Convenor of IRG, that there is a plan to raise the status of the `kCantonese` Unihan property from provisional to informative. While there is already a pinyin collation type that utilize the informative `kMandarin` property, currently no dedicated value exists for Cantonese phonetic collation.

Furthermore, jyutping is already a registered BCP 47 language subtag.

The addition of jyutping as a new collation will streamline the search of Chinese text by their Cantonese pronunciation, eliminating the need for users to be familiar with Mandarin pronunciation or to count the number of strokes of the characters. This will benefit software applications and databases that handle Chinese content, as well as Cantonese speakers in Hong Kong, Macau, the Guangdong province and communities worldwide.

Proposal

Once the kCantonese property values have been reviewed by the team led by Prof HOU Xingquan, we will liaise with CLDR-TC to elevate the property status to informative. At that time, these data from the Unihan database should be incorporated into to CLDR as the jyutping collation.

That is all.

Activity

Show:

graphemecluster December 15, 2024 at 11:33 PM

Indeed, specifing the jyutping variant subtag and u-co-jyutping together is meaningless (since u-co-jyutping sorts Han characters, not romanization), and the mention of it was just FYI and for comparison.


In Hong Kong and neighbouring regions, it is a common practice to read aloud Written Chinese (as well as Literary Chinese) passages using Cantonese readings of individual characters, especially as a teaching methodology. Therefore, IMO, it might be advantageous to incorporate this collation type into the ancestor zh macrolanguage, enabling its use across all individual languages (particularly yue and cmn) under it.

Markus Scherer December 5, 2024 at 10:44 PM

sgtm – I think what we need:

  • kCantonese data in good shape / informative status

  • sorting rules for the romanized form, if it differs from the default Latin-script sort order – see https://cldr.unicode.org/index/cldr-spec/collation-guidelines

  • a subset of the Han characters that are commonly used in Cantonese; most implementations don’t carry pinyin/stroke/zhuyin/… collation tailorings for every one of the 90000+ Han characters, but for a subset of maybe 20000 (fewer would be nicer) – see the existing collation/zh CLDR data for examples – see also kIICore: for a possible start and maybe add or remove characters from there

Mingfei Lau December 5, 2024 at 5:42 PM
Edited

It is difficult to estimate how commonly this sort order is used because currently no system supports sorting by Jyutping (which is the exactly why we are making this proposal). We may use the DAU of Jyutping keyboards of different platforms (gboard, iOS and macOS, windows Jyutping IME etc.) as a proxy, which adds up to at least 500k daily users worldwide (a very rough estimation, and the sources of this number are private). Is this the number that we need to proceed?

yue-u-co-jyutping SGTM. Is there anything we can do to facilitate this proposal?

Markus Scherer December 1, 2024 at 3:22 AM

Interesting. This would be another large tailoring, and we would need to look into a reasonable subset of Han characters for normal use. And write tools code to derive the collation tailoring from the kCantonese data.

Is this sort order commonly used?

We don’t need a variant subtag for collation tailorings; we have a keyword for that. This would be yue-u-co-jyutping, similar to zh-u-co-stroke and de-u-co-phonebk.

Mark Davis November 30, 2024 at 12:41 AM

That sounds like a useful addition, thanks.

Details

Priority

Assignee

Reporter

Components

Labels

Created November 29, 2024 at 11:55 PM
Updated December 15, 2024 at 11:33 PM