Direct access to ICU charsets in Java?

Description

Currently, as far as I can tell, due to the protected or private status of the constructors for all the CharsetICU implementations, the only way to get an actual instance of one of the ICU charsets is to use the static CharsetICU.charsetForName(String name) method. Doing this works, but appears to involve a fairly convoluted process of calls back and forth between CharsetProviderICU, and at least two calls into (are they deprecated?) internal functions to fetch the names and aliases of the charsets from those processed text files. After all that, this data (aliases, for instance), if we are using the CharsetICU implementation anyway, isn't actually used for anything, even by the extended java.nio.charset.Charset, that I can find. Then finally the actual charset is instantiated wtih reflection, which while a small difference, takes longer than the new operator, no?

May I suggest that it's appropriate to have some public class analogous to the java.nio.charset.StandardCharsets, to allow direct, programmatic access to an instance of a given charset? Something like this (UnicodeCharsets.java):

Activity

Show:
Markus Scherer
March 26, 2020, 12:34 AM

UTF-32 is useful for processing because you get fixed-width random access to code points. As a storage format, it’s a colossal waste of space.

CESU-8 is basically a formal name for someone’s legacy broken almost-UTF-8 encoding.

BOCU-1 is kind of clever (or so I claim…) but got nowhere because of the licensing issue.

SCSU is… nice, but it has also fallen out of favor since the HTML community rejected any charsets that use bytes 00..7F for anything other than US-ASCII. (Security issues when using heuristic charset detection, which is common. Same for BOCU-1, HZ, and others.)

Also, the point of BOCU-1 & SCSU was to provide a compact encoding without having to use a general-purpose compression (e.g., zip). We have seen that people who care about size would rather do the latter, and then it matters little what the input encoding is.

So… I recommend you use StandardCharsets.

ICU converters are mostly useful if you want to support more charsets, or you want the specific mapping tables that ICU has (or modify them etc.).

Joshua Chambers
March 26, 2020, 12:20 AM

The one exception that could theoretically be relevant is the one thrown by Charset.checkName(String s), called from the constructor of Charset:

 

 

Since we’re using literals to call the function, we can guarantee that we won’t trigger this exception, but could handle that possibility if we have to, eh?

Joshua Chambers
March 25, 2020, 10:37 PM
Edited

Instantiating charset with forName() requires exception handling.

  • The UnsupportedCharsetException seems only in reference to looking up the name, which is only used to find the Charset.

  • The IOException seems to relate to finding the aliases.

  • ClassNotFoundException, InvocationTargetException, NoSuchMethodException, IllegalAccessException, and InstantiationException all only have to be there because we are finding the constructor and instantiating it with reflection.

So unless I’m missing some exception handling here, then all of these are only needed because we’re instantiating them the way we are, and would be irrelevant if we just called the constructor directly. No?

ICU support for UTF_8 / UTF_16 / UTF_16BE / UTF-16LE should be nothing different from Java. At this point, I’d recommend Java developers to use stock Java implementation.

Mightn’t someone want to use the ICU versions explicitly? They already can, by calling the CharsetICU.charsetForName() which will return the ICU charset if it exists and then the stock one if not. It’s just such a convoluted process, unnecessarily slowing down the instantiation of a charset.

Do you actually use these (historical) Unicode encodings?

My intention in the project I’m working on is to support all the UTF encodings (8, 7, 16LE, 16BE, 32LE, 32BE). I included the others in my code only because they are unicode charsets, and that’s what I’d called the class. If I hadn’t, it would have just been “ClassesIWantEasierAccessTo”.

Whichever classes should go in there, it seems to me like nothing is lost by creating a more convenient way to get at these classes, just like java did. We could call it UTFCharsets and not include those three?

Yoshito Umaoka
March 25, 2020, 8:34 PM

Joshua, I understand your proposal, but I’m not sure it really makes sense practically.

Instantiating charset with forName() requires exception handling. I think Java introduced StandardCharsets for convenience. From this point, your suggestion makes sense.

ICU charset provider includes these Unicode encodings. Java already provides UTF_8 / UTF_16 / UTF_16BE / UTF-16LE. It’s true that UTF_32BE/LE UTF_7 BOCU_1 CESU_8 and SCSU are not supported by Java SE, but I’m wondering value of these encodings. UTF_7 / BOCU_1 / CESU_8 are mostly historic support only. UTF_32BE/LE is not commonly used.

ICU support for UTF_8 / UTF_16 / UTF_16BE / UTF-16LE should be nothing different from Java. At this point, I’d recommend Java developers to use stock Java implementation.

For other encodings not supported by typical Java runtime - I think importance of these charsets becomes very low. I’m not sure providing convenient class for them really makes sense at this point.

Do you actually use these (historical) Unicode encodings?

Assignee

Yoshito Umaoka

Reporter

Joshua Chambers

Components

Priority

minor

Fix versions