"ucnv_toUChars" function returns different output (destLength) in ICU 61.1 from ICU 59.1

Description

I am using "ucnv_toUChars" function to convert a std::string to a Unicode string using "UTF-8" converter as mentioned in the code below-

#include <iostream>
#include <unicode/ucnv.h>
#include <uchar.h>
#include <cstddef>

void print(const std::basic_string<char16_t>& str)
{
for(size_t i = 0;i < str.size();i++)
{
std::cout << " " << static_cast<char>(str[i]);
}
std::cout << std::endl;
}

int main()
{

const std::string Bytes = {static_cast<char>(0xC0),
static_cast<char>(0x8A)};

UConverter *conv = nullptr;
UErrorCode status = U_ZERO_ERROR;
int32_t targetSize = 1024;
char16_t target[1024];

conv = ucnv_open("UTF-8", &status);
targetSize = ucnv_toUChars(conv,target, targetSize, Bytes.c_str(), Bytes.size(),&status);
ucnv_close(conv);

std::cout << "Target size: " << targetSize << "\n";

print(target);

return 0;
}

Using ICU 61.1, the output (targetSize) comes out to be 2; while using ICU 59.1, it comes out to be 1

Why am I seeing the difference in behavior of "ucnv_toUChars" function in ICU 61.1 and ICU 59.1? Is this a known behavior?

Activity

Show:
Jeff Genovy
August 8, 2018, 5:48 PM

Note: There was a change in ICU 60 for how error sequences are handled.

"ICU now handles ill-formed UTF-8 byte sequences as specified in the W3C Encoding Standard. (#13311)"

http://site.icu-project.org/download/60

Deepak Nair
December 31, 2019, 10:45 AM

Hi Jeff,

Any updates on this ticket? As you mentioned above, since the handling of error sequences changed from ICU 60 onwards is this an expected behaviour for ICU 60 and above (I noticed the same behaviour with ICU 64.2 also)?

 

Markus Scherer
September 30, 2020, 1:37 AM

Sorry for the late reply. Yes, this is behaving correctly. C0 is not a lead byte for any valid UTF-8 sequence, so according to the newer recommendation it is an ill-formed subsequence on its own. Then followed by another ill-formed sub-sequence consisting of the one byte 8A.

Assignee

Markus Scherer

Reporter

Archita Pruthi

Components

Labels

Reviewer

None

Priority

assess

Time Needed

None

Fix versions

None
Configure