conversion: consistent illegal sequences
Description
relates to
Activity
Trac Comment 13 by @Markus Scherer—2016-10-05T23:15:37.259Z
Milestone 4.1.2 deleted
Trac Comment 12 by 515246540@a51423369a5d5a02—2012-06-14T09:04:26.823Z
Trac Comment 9 by @Peter Edberg—2008-10-20T18:23:24.000Z
Note, this also addressed many of the problems in ticket #6511.
Trac Comment 7 by @Markus Scherer—2008-10-16T03:38:05.000Z
The final policy implemented here is:
We include at least the first byte in the illegal sequence
which we report via the callback.If any of the non-initial bytes could be the start of a character,
or the start of an escape sequence etc.,
we stop the reported illegal sequence before the first one of those,
and restart the conversion with the potential starter byte.
The fix covers the most commonly used converters (the ones most commonly found on the web): Table-based MBCS/DBCS (including EBCDIC), ISO-2022-*, and HZ.
I tried to not change the behavior where it wasn't necessary. In particular, the (logically) simplest error handling would always report only the first byte, but that seems very unnatural and would be a big change in converter behavior. Instead, sometimes the "offending byte" is included in the illegal sequence like it used to be (if it can't start a new character) and sometimes not (if it can).
If the whole sequence is 3 or 4 bytes (EUC-JP/GB 18030), the converter may go back to stop before an even earlier byte. Sometimes it has to back out bytes from a previous buffer. I was happy to see that I could reuse the "replay" mechanism from the m:n conversion feature, and that it worked fine when used from converters that don't otherwise support that feature.
I added some extra code in ucnvmbcs.c to detect if what the state table says is a lead byte can truly start a character. This is mostly for our DBCS conversion tables where the state tables always produce two-byte sequences, but many lead bytes only ever start illegal sequences. I found it more natural, and it reduced the need to modify existing tests: Otherwise, just by looking at the state table the converter would see that FF is a valid lead byte when in fact it never starts a legal sequence in DBCS charsets.
Other converters:
SBCS converters are trivially unaffected.
The UTF-8 converter has always stopped before a non-trail-byte.
(Same for CESU-8.)We should not modify UTF-16 and UTF-32 (and their variants)
because they are not really byte-based encodings:
Users expect to be able to memcpy() them, at most with a byte swap,
so we need to continue to always deliver pairs or quads of bytes.I did not test or change or think much about any other converters:
UTF-7, IMAP-mailbox-name, SCSU, BOCU-1, ISCII, LMBCS
Trac Comment 6 by @Yoshito Umaoka—2008-10-14T20:35:00.000Z
Created a ticket for J porting. After the design is agreed and implemented in C, https://unicode-org.atlassian.net/browse/ICU-6583#icft=ICU-6583 is used for J porting.
ICU conversion has mixed behavior for how many bytes to include in illegal sequences (such that the fundamental encoding scheme is violated) that are reported via callbacks and ucnv_getInvalidChars() - which also determines where the conversion stops with the stop callback.
When converting from UTF-8 to Unicode, the "offending" byte (non-trail byte in trail byte position) is excluded; the illegal sequence stops before that byte, and if the callback resets the error code, then the converter restarts with that byte. For example, in <E3 80 22> the illegal sequence will be <E3 80> and the converter will either stop before/on the 22 or restart there. (ucnv_fromUnicode() behaves like this as well, for the UTF-16 input.)
When converting from table-based (.cnv-based) MBCS charsets to Unicode, the "offending" byte is included in the illegal sequence. For example, in Shift-JIS, <81 22 61> would result in an illegal sequence of <81 22> and the converter stops before/on the 61 or restarts there.
Converters should be consistent.
There seem to be three reasonable behaviors:
a. Include the offending byte in the illegal sequence.
b. Exclude the offending byte, with a multi-byte illegal sequence up to just before it.
c. Make all illegal sequences single bytes; that is, regardless of how many bytes were read, go back to after the first one if there is an error.
Transmission errors are rare with modern protocols. The most common error is assumed to be faulty software that truncates in the middle of a character at the end of some buffer, which might then be concatenated with other buffers. In other words, we assume that the most common cause of encoding errors is too few trail bytes in a multi-byte sequence.
Option b (exclude but go up to offending byte) works best for that because it restarts with the following single or lead byte. Option a (include offending byte) would swallow a single-byte character (usually an ASCII character, which might destroy XML/HTML syntax), while option c (restart after first byte of bad sequence) will sometimes (not for UTF-8 but for MBCSes where lead and trail byte value ranges overlap) restart on a trail byte that can be mistaken for a lead byte.
In addition, option c cannot be implemented in ICU: With streaming conversion, especially in the extreme case of only passing one byte at a time into the converter, the source pointer would have to be moved to before the current buffer in order to stop after the first byte of a 3-byte or longer sequence. For <E3 80 22>, each byte may be passed in in a separate call to ucnv_toUnicode(), and when reading the third byte (22), the pointer can be either set to after the 22 (option a) or before the 22 (option b) but not after the E3 because that would be index -1 of the current buffer.
In summary, it would be best to make all converters consistently implement option b, stopping just before an "offending" byte. Consistent behavior would then also allow a callback to choose to move the source pointer forward by one unit to emulate option a, or to do consistent error reporting.