We're updating the issue view to help you get more done. 

Consider defaulting to UTF-8 on Linux


I'm using Ubuntu Linux 8.04 (the latest distribution at this time). When I install every single available locale, they are all UTF-8 based locales, except for the C and POSIX locales. Some other recent Linux distributions are also removing the non-UTF-8 based locales.

Currently, int_getDefaultCodepage in putil.c can't call setlocale(LC_CTYPE, "") to force nl_langinfo to provide the actual codepage being used. This is because this call is not thread safe, and we can't force users to call this function or expect them to call the function in a thread safe manner. This is why the only setlocale(LC_CTYPE, NULL) is used instead. Unfortunately, this prevents the correct codepage from being detected. So if zh_CN is used instead of zh_CN.utf8, ICU defaults to US-ASCII instead of UTF-8, which usually isn't helpful.

This scenario is similar to what happens on Mac OS X, but instead of nl_langinfo returning US-ASCII, Mac OS X returns "" from nl_langinfo. There is no perfect solution to fix this problem, but as more Linux distributions default to UTF-8 locales, it might make more sense to have ICU default to UTF-8 on Linux too. When that happens, the U_LINUX section of remapPlatformDependentCodepage might want to remap US-ASCII to UTF-8 for the non-C/POSIX locales.





George Rhoten

Time Needed



Fix versions