We're updating the issue view to help you get more done. 

Expose udata_getLength?


Currently there is no way to get the length of the actual data in a UDataMemory object. However, there is an internal function which does this (in internal udatamem.h, not public udata.h):

1 int32_t udata_getLength(const UDataMemory *pData)

Its implementation includes the following comments:

1 2 3 4 5 6 7 8 9 10 11 * TODO Consider making this function public. * It would have to return the actual length in more cases. * For example, the length of the last item in a .dat package could be * computed from the size of the whole .dat package minus the offset of the * last item. * The size of a file that was directly memory-mapped could be determined * using some system API. * * In order to get perfect values for all data items, we may have to add a * length field to UDataInfo, but that complicates data generation * and may be overkill.

Use case (but maybe not, see later):
1. C++ and C provide ways to create a [U]BreakIterator from compiled (binary) break rules:

  • RuleBasedBreakIterator:: RuleBasedBreakIterator(const uint8_t *compiledRules, uint32_t ruleLength,...);

  • UBreakIterator* ubrk_openBinaryRules(const uint8_t *binaryRules, int32_t rulesLength,...);
    2. Where does the ruleLength value come from?

  • The binary rules might be obtained from [ubrk_]getBinaryRules, which provides a length. Or...

  • The binary rules might be obtained by running genbrk to create some.brk file, using udata_open to create a UDateMemory representing the data, and then using udata_getMemory to get the pointer to the binary rules to pass in. But in this case, how does one get the rulesLength to pass in? '''This is why udata_getLength would presumably be useful.'''

'''But does the rulesLength parameter actually matter?'''

Internally, the RuleBasedBreakIterator assumes that the const uint8_t *binaryRules pointer passed in is actually a pointer to a RBBIDataHeader, and gets the actual rules length from the RBBIDataHeader fLength field. The rulesLength parameter passed is only checked for two things:
1. It is checked to make sure it is >= size of RBBIDataHeader, to ensure there is enough data for a valid RBBIDataHeader. Then...
2. It is checked to make sure it is >= the RBBIDataHeader fLength field.
So if const uint8_t *binaryRules does point to valid data, a caller could always pass INT32_MAX for rulesLength. '''Presumably we don't want to encourage or rely on this, however.'''




Peter Edberg


Peter Edberg


Feb 15, 2017, 6:35 AM










Fix versions