By default, Han characters map to collation elements with computed primary weights. The set of those characters is hardcoded and must be updated when new Han characters are assigned in Unicode. This must be synchronized between the FractionalUCA.txt generator code and ICU's runtime code. It would be much better to make this fully data-driven.
In my collation prototype's builder, I use the `[:Unified_Ideograph:]` set, but that still depends on the UCD and UCA data to be in sync, and complicates bootstrapping for a version update.
I propose that we add data into FractionalUCA.txt to list the `[:Unified_Ideograph:]` set. This could be in collation order – or in code point order, and the parser/builder would put CJK Extension A after Block=CJK_Unified_Ideographs and Block=CJK_Compatibility_Ideographs.
This would best be done together with, or after, .
The Unified_Ideograph data should precede the regular mappings so that the Han character weights are established before they are referenced in mappings involving decompositions to Han characters.
This is what this might look like, in collation order:
For the parser, it would be nice if there was terminating syntax other than a regular mapping. The simplest might be to print a single, long line like
This will go into CLDR 24, with data for UCA 6.3.