New API to check confidence of particular encoding scheme
Description
Activity
Hi @Markus Scherer , UTF-8 was a sample I was talking about but according to the user’s input and the kind of encoding we’re using we might need to check confidence of any other scheme as well. Like maybe for UTF-16. So, the API I’m proposing would be able to cater that need. It will get the confidence of the asked encoding scheme only rather than giving the confidence level of every encoding scheme.
For checking whether a UTF-8 string is well-formed, you can do this:
See https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ustring_8h.html
@Ria Jain what does “confidence” mean for whether a string is UTF-8? If you mean “well-formed UTF-8” with no exceptions, then there are ICU functions you can call. If you want to accept some number of ill-formed parts, then how do you calculate the confidence value?
Hi @Markus Scherer ,Your API gets confidence level for all encoding schemes which increases so much time in our network call. We just want to check if the string is UTF-8 or not.
I calculated the time difference between your API and the API which I created to get confidence for UTF-8. Below is my finding:
Existing public API: 103 microseconds
New API: 3 microseconds
Which is my we thought it would be useful to have API which can only get confidence for particular encoding.
@Ria Jain please take a look at the question above
Details
Details
Assignee
Reporter
Priority

Right now, there is an API to check confidence level of each encoding scheme i.e.(
ucsdet_detectAll
) but sometimes we need to check the confidence level of a particular encoding scheme. Due to which, if we use this detectAll API then it consumes so much time to go through the confidence level of each encoding scheme. So, instead of this it would be better if we had a API to check confidence level of a particular encoding scheme.