New API to check confidence of particular encoding scheme

General

Other Data

General

Other Data

Description

Right now, there is an API to check confidence level of each encoding scheme i.e.(ucsdet_detectAll) but sometimes we need to check the confidence level of a particular encoding scheme. Due to which, if we use this detectAll API then it consumes so much time to go through the confidence level of each encoding scheme. So, instead of this it would be better if we had a API to check confidence level of a particular encoding scheme.

Activity

Show:

Ria Jain

March 27, 2024 at 7:06 AM

Hi @Markus Scherer , UTF-8 was a sample I was talking about but according to the user’s input and the kind of encoding we’re using we might need to check confidence of any other scheme as well. Like maybe for UTF-16. So, the API I’m proposing would be able to cater that need. It will get the confidence of the asked encoding scheme only rather than giving the confidence level of every encoding scheme.

Markus Scherer

March 7, 2024 at 4:03 PM

For checking whether a UTF-8 string is well-formed, you can do this:

See https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/ustring_8h.html

Markus Scherer

February 15, 2024 at 5:01 PM

@Ria Jain what does “confidence” mean for whether a string is UTF-8? If you mean “well-formed UTF-8” with no exceptions, then there are ICU functions you can call. If you want to accept some number of ill-formed parts, then how do you calculate the confidence value?

Ria Jain

January 12, 2024 at 7:14 AM

Hi @Markus Scherer ,Your API gets confidence level for all encoding schemes which increases so much time in our network call. We just want to check if the string is UTF-8 or not.

I calculated the time difference between your API and the API which I created to get confidence for UTF-8. Below is my finding:

Existing public API: 103 microseconds

New API: 3 microseconds

Which is my we thought it would be useful to have API which can only get confidence for particular encoding.

Markus Scherer

November 30, 2023 at 5:51 PM

@Ria Jain please take a look at the question above

Resize issue view side panel

Needs More Information

Details

Assignee

Unassigned

Reporter

Ria Jain

Priority

assess

Created September 8, 2023 at 11:09 AM

Updated March 27, 2024 at 7:06 AM

Resolved March 7, 2024 at 4:03 PM