nameToASCII and nameToUnicode : limit the input length before embarking on a potentially slow operation?
Description
Attachments
Activity
Markus Scherer August 18, 2020 at 12:55 AM
I just found that in ICU C++, the encode function has a limit of MAX_CP_COUNT=200 code points.
Markus Scherer August 18, 2020 at 12:43 AM(edited)
The core of the problem is that the Punycode algorithm is n^2, and I cannot think of a way to make it faster.
The simplest solution is to pick a number out of a hat for a maximum length of strings to be encoded or decoded, and make the Punycode encode and decode functions throw an illegal-argument exception when the input is longer than that.
I just experimented with the Punycode encoding of various strings.
The minimum is 1 byte per character or code point.
The most I get for longer strings is about 4 bytes per code point with most of them supplementary ones -- or 3.5B/char sprinkled across only the BMP.
That is, for a UTF-16 string of length 1000 code units you get a Punycode string of 1000 to 3512 bytes (plus the 4-byte prefix "xn--" when used in an IDNA label).
A well-formed label is limited to 63 bytes, which means at most 59 bytes after "xn--". However, we don't have any limit so far, and people sometimes use libraries and protocols with non-standard inputs.
Something 1000-ish seems like a reasonable compromise to keep n^2 tame and users happy even with somewhat unusual inputs.
I propose to set a limit in Punycode.encode() of 1000 input code units, and in Punycode.decode() of 2000 input bytes.
Notes from experiments below. “100*” means 100 times the following code point or similar. Lengths are in UTF-16 code units or ASCII bytes, with numbers of code points in parentheses.
Markus Scherer August 14, 2020 at 10:12 PM
crbug 811960 comment 7 has two stack traces. The one quoted above with replaceLabel() near the bottom seems spurious. The more interesting one invokes u_strFromPunycode(). Same in comment 15 there.
This crbug has been closed as a duplicate of https://bugs.chromium.org/p/chromium/issues/detail?id=804462 which says “Passing this string to the GURL constructor causes u_strFromPunycode to make over 65000 insertions into an array, each one using memmove to shift ~38kb of data by 2 bytes. (Over 2GB moved in total)“
The workaround applied in Chromium checks for a domain name length of at most 5*253 before calling ICU’s UTS #46 code.
UnicodeBot June 30, 2018 at 11:23 PM
Trac Comment 9 by —2018-05-16T18:40:35.910Z
Tighter limit after normalization?
Try to make our Punycode implementation n log n rather than n square?
Coming from ChromeBug:811960 and ChromeBug:834838
When given a very long input string, nameToASCII and nameToUnicode can take very long. nameToASCII spends a lot of time in memmove (see the profiling result in the Chrome bug above).
nameToAscii and nameToUnicode set the error flag for too long label domain name or domain label (> 243 for FQDN and > 63 for a label) after finishing the calculation.
However, that does not help because this issue happens during the 'calculation'.
One (rather drastic) change that can be considered [1] is to return early (e.g. in UTS46:rocess() ) if the input is too long with an invalid argument error (or a new error specifically added for this case) set to |status| (this is different from IDN error flag which is separate).
[1] https://chromium-review.googlesource.com/c/chromium/deps/icu/+/1012919/1/source/common/uts46.cpp