Create second collation for Norwegian that doesn't treat aa as å

Description

Hi!

In the collation for Norwegian was updated to treat “aa” as “å”. This is fine and correct for texts with mostly Norwegian/Scandinavian names, but in Wikipedia (we use ICU collation) it is causing some issues. Scandinavian names with “aa” should be treated as “å”, but in non-Scandinavian names with “aa” should sort as “aa” and not “å”. Currently, names like “Haag”, “Aachen” and “Aaron Carter” are sorted as “Håg”, “Åchen” and “Åron Carter”, which is not correct. Since A and Å are on opposite ends of the Norwegian alphabet, the difference is very noticeable.

My proposal is to leave the standard collation as it is (it is fine for most purposes), but include a second collation that doesn’t include the aa=å rules.

Activity

Markus Scherer
March 21, 2025 at 3:16 PM

The way we currently override this is to say that “Aachen” should be sorted as “A’achen”

Unicode recommends using a specific character to break the matching of multi-character sequences in collation:

The proposed sort order seems a bit arbitrary to me – follow Norwegian conventions except for that digraph. It seems like when the language is Norwegian, using the normal sort order but then inserting CGJs in non-Norwegian, non-Danish names seems appropriate, even if in your case you have more of those than native ones.

Is there a standard or reference that could help motivate adding this sort order to CLDR?

Annemarie Apple 🍎
April 4, 2024 at 5:04 AM

Bulk moving all issues to the next version which aren't in component type: brs, charts, docs, docs-spec

Annemarie Apple 🍎
October 4, 2023 at 4:41 AM

Bulk moving all tickets which are not in component (BRS, charts, docs, docs-spec, keyboards) with status Investigate status to v45

Mark Davis
May 22, 2023 at 7:02 PM

This would appear to need more than just a second collation sequence to be practical. If one is sorting “Aaron” vs “Aarfinn” (aka “Årfinn”) one would need to have dictionary-based data to determine what to do — not just a second collation sequence. One might want to “preprocess” the foreign words to insert an invisible breaking character on a word-by-word basis.
It would appear that the same reasoning would apply to Danish, Swedish, and perhaps other languages.

Resize issue view side panel

Details

Priority

medium

Assignee

Markus Scherer

Reporter

Jon Harald Søby

Fix versions

Components

Labels

punted44punted45punted46punted47sortordertriaged

Created November 18, 2022 at 1:56 PM

Updated March 24, 2025 at 12:36 PM