USpoof uses NFKD, should be NFD

Description

According to confusables.txt version 2.1 (shipped with ICU 4.6), U+017F (ſ) should be treated as confusable with U+0066 (f).

Currently, USpoof normalizes all input to NFKD, not NFD, before applying the confusable mapping. UTR specifies to use NFD, not NFKD:

http://unicode.org/reports/tr39 ()/

To see whether two strings X and Y are confusable according to a given table (abbreviated as X ≅ Y), an implementation uses a transform of X called a skeleton(X) defined by:

  1.  

    1. Converting X to NFD format, as described in [UAX15].
      2. Successively mapping each source character in X to the target string according to the specified data table.
      3. Reapplying NFD.


Because USpoof normalizes to NFKD, U+017f is normalized to "s", and thus its skeleton differs from "f". See attached test case that reproduces the issue against ICU 4.6.

I made a patch to switch USpoof to NFD (attached), but it makes several intltest tests fail, since they assumed NFKD. If this patch looks like the right approach, should we just fix or remove the bad tests?

Activity

Show:
TracBot
June 30, 2018, 11:45 PM
Trac Comment 2 by —2011-02-24T01:30:09.088Z

I'll apply the patch and fix whatever tests break. UTS 39 changed from specifying NFKD to NFD between revision 3 and 4, and I overlooked it.

TracBot
June 30, 2018, 11:45 PM
Trac Comment 6 by —2016-10-05T23:13:36.787Z

Milestone 4.7.1 deleted

Fixed

Assignee

Andy Heninger

Reporter

TracBot

Components

Labels

None

Reviewer

None

Priority

assess

Time Needed

None

Fix versions