nameToASCII and nameToUnicode : limit the input length before embarking on a potentially slow operation?

Description

Coming from ChromeBug:811960 and ChromeBug:834838

When given a very long input string, nameToASCII and nameToUnicode can take very long. nameToASCII spends a lot of time in memmove (see the profiling result in the Chrome bug above).

nameToAscii and nameToUnicode set the error flag for too long label domain name or domain label (> 243 for FQDN and > 63 for a label) after finishing the calculation.

However, that does not help because this issue happens during the 'calculation'.

One (rather drastic) change that can be considered [1] is to return early (e.g. in UTS46:rocess() ) if the input is too long with an invalid argument error (or a new error specifically added for this case) set to |status| (this is different from IDN error flag which is separate).

[1] https://chromium-review.googlesource.com/c/chromium/deps/icu/+/1012919/1/source/common/uts46.cpp

Attachments

1

Activity

Markus Scherer 
August 31, 2020 at 11:54 PM

Markus Scherer 
August 18, 2020 at 12:55 AM

I just found that in ICU C++, the encode function has a limit of MAX_CP_COUNT=200 code points.

Markus Scherer 
August 18, 2020 at 12:43 AM
(edited)

The core of the problem is that the Punycode algorithm is n^2, and I cannot think of a way to make it faster.

The simplest solution is to pick a number out of a hat for a maximum length of strings to be encoded or decoded, and make the Punycode encode and decode functions throw an illegal-argument exception when the input is longer than that.

I just experimented with the Punycode encoding of various strings.

  • The minimum is 1 byte per character or code point.

  • The most I get for longer strings is about 4 bytes per code point with most of them supplementary ones -- or 3.5B/char sprinkled across only the BMP.

  • That is, for a UTF-16 string of length 1000 code units you get a Punycode string of 1000 to 3512 bytes (plus the 4-byte prefix "xn--" when used in an IDNA label).

A well-formed label is limited to 63 bytes, which means at most 59 bytes after "xn--". However, we don't have any limit so far, and people sometimes use libraries and protocols with non-standard inputs.

Something 1000-ish seems like a reasonable compromise to keep n^2 tame and users happy even with somewhat unusual inputs.

I propose to set a limit in Punycode.encode() of 1000 input code units, and in Punycode.decode() of 2000 input bytes.

Notes from experiments below. “100*” means 100 times the following code point or similar. Lengths are in UTF-16 code units or ASCII bytes, with numbers of code points in parentheses.

Markus Scherer 
August 14, 2020 at 10:12 PM

crbug 811960 comment 7 has two stack traces. The one quoted above with replaceLabel() near the bottom seems spurious. The more interesting one invokes u_strFromPunycode(). Same in comment 15 there.

This crbug has been closed as a duplicate of https://bugs.chromium.org/p/chromium/issues/detail?id=804462 which says “Passing this string to the GURL constructor causes u_strFromPunycode to make over 65000 insertions into an array, each one using memmove to shift ~38kb of data by 2 bytes. (Over 2GB moved in total)“

The workaround applied in Chromium checks for a domain name length of at most 5*253 before calling ICU’s UTS #46 code.

UnicodeBot 
June 30, 2018 at 11:23 PM

Trac Comment 9 by —2018-05-16T18:40:35.910Z

Tighter limit after normalization?

Try to make our Punycode implementation n log n rather than n square?

Fixed

Details

Assignee

Reporter

Components

Labels

Priority

Time Needed

Days

Fix versions

Created June 28, 2018 at 5:27 PM
Updated August 31, 2020 at 11:54 PM
Resolved August 31, 2020 at 11:54 PM