Proposal to add `jyutping` as a new collation type
Description
Activity
graphemecluster December 15, 2024 at 11:33 PM
Indeed, specifing the jyutping
variant subtag and u-co-jyutping
together is meaningless (since u-co-jyutping
sorts Han characters, not romanization), and the mention of it was just FYI and for comparison.
In Hong Kong and neighbouring regions, it is a common practice to read aloud Written Chinese (as well as Literary Chinese) passages using Cantonese readings of individual characters, especially as a teaching methodology. Therefore, IMO, it might be advantageous to incorporate this collation type into the ancestor zh
macrolanguage, enabling its use across all individual languages (particularly yue
and cmn
) under it.
Markus Scherer December 5, 2024 at 10:44 PM
sgtm – I think what we need:
kCantonese data in good shape / informative status
sorting rules for the romanized form, if it differs from the default Latin-script sort order – see https://cldr.unicode.org/index/cldr-spec/collation-guidelines
a subset of the Han characters that are commonly used in Cantonese; most implementations don’t carry pinyin/stroke/zhuyin/… collation tailorings for every one of the 90000+ Han characters, but for a subset of maybe 20000 (fewer would be nicer) – see the existing collation/zh CLDR data for examples – see also kIICore: for a possible start and maybe add or remove characters from there
Mingfei Lau December 5, 2024 at 5:42 PMEdited
It is difficult to estimate how commonly this sort order is used because currently no system supports sorting by Jyutping (which is the exactly why we are making this proposal). We may use the DAU of Jyutping keyboards of different platforms (gboard, iOS and macOS, windows Jyutping IME etc.) as a proxy, which adds up to at least 500k daily users worldwide (a very rough estimation, and the sources of this number are private). Is this the number that we need to proceed?
yue-u-co-jyutping
SGTM. Is there anything we can do to facilitate this proposal?
Markus Scherer December 1, 2024 at 3:22 AM
Interesting. This would be another large tailoring, and we would need to look into a reasonable subset of Han characters for normal use. And write tools code to derive the collation tailoring from the kCantonese data.
Is this sort order commonly used?
We don’t need a variant subtag for collation tailorings; we have a keyword for that. This would be yue-u-co-jyutping, similar to zh-u-co-stroke and de-u-co-phonebk.
Mark Davis November 30, 2024 at 12:41 AM
That sounds like a useful addition, thanks.
I would like to propose the value
jyutping
(Cantonese romanization) as a collation type to be added to the list of collation, as well as the addition of Cantonese pronunciation of Chinese (Han) characters to `collation/zh` and `collation/zh-Hant`.Background
I have been informed by Dr Ken Lunde, the Chair of the CJK & Unihan Working Group and the Convenor of IRG, that there is a plan to raise the status of the `kCantonese` Unihan property from provisional to informative. While there is already a
pinyin
collation type that utilize the informative `kMandarin` property, currently no dedicated value exists for Cantonese phonetic collation.Furthermore,
jyutping
is already a registered BCP 47 language subtag.The addition of
jyutping
as a new collation will streamline the search of Chinese text by their Cantonese pronunciation, eliminating the need for users to be familiar with Mandarin pronunciation or to count the number of strokes of the characters. This will benefit software applications and databases that handle Chinese content, as well as Cantonese speakers in Hong Kong, Macau, the Guangdong province and communities worldwide.Proposal
Once the
kCantonese
property values have been reviewed by the team led by Prof HOU Xingquan, we will liaise with CLDR-TC to elevate the property status to informative. At that time, these data from the Unihan database should be incorporated into to CLDR as thejyutping
collation.That is all.