UBiDi and UText Enhancements Roll-up

Description

This ticket is a roll-up of outstanding Tickets that are related to supporting UText functionality in UBidi.

In other functions within the ICU library, the UText abstraction facility is used to allow any text storage and encoding provider to be used, however, UBiDi currently does not support UText. One of the reasons for the lack of support is the ubidi_writeReordered() and ubidi_writeReverse() functions which write to the provided UChar arrays. The implementation of the UChar UText Provider currently does not support write operations.

In making changes to the ubidi_writeReordered() and ubidi_writeReverse() functions, the deficiencies of the existing UText UChar Provider need to be addressed. At the same time a general review of the current implementations for UChar arrays (UTF-16), uint8_t arrays (UTF-8) can be completed.

The following new API functions are Proposed for UBidi to support UText:

void ubidi_setUContext(UBiDi *pBiDi,
UText *prologueUt,
UText *epilogueUt,
UErrorCode *pErrorCode);

void ubidi_setUPara(UBiDi *pBiDi, UText *pText,
UBiDiLevel paraLevel, UBiDiLevel *embeddingLevels,
UErrorCode *pErrorCode);

UBiDiDirection ubidi_getUBaseDirection(UText *ut);

const UText * ubidi_getUText(const UBiDi *pBiDi);

int32_t ubidi_writeUReordered(UBiDi *pBiDi,
UText *dstUt,
uint16_t options,
UErrorCode *pErrorCode);

int32_t ubidi_writeUReverse(UText *srcUt,
UText *dstUt,
uint16_t options,
UErrorCode *pErrorCode);

The following new API function is Proposed to test if a UText is valid:

UBool utext_isValid(const UText* ut);

The following new API functions are Proposed to provide a consistent entry for creating internal array-based UText Providers:

UText * utext_openConstU16(UText *ut,
const UChar *s, int64_t length, int64_t capacity,
UErrorCode *status);

UText * utext_openU16(UText *ut,
UChar *s, int64_t length, int64_t capacity,
UErrorCode *status);

UText * utext_openConstU8(UText *ut,
const uint8_t *s, int64_t length, int64_t capacity,
UErrorCode *status);

UText * utext_openU8(UText *ut,
uint8_t *s, int64_t length, int64_t capacity,
UErrorCode *status);

UText * utext_openConstU32(UText *ut,
const UChar32 *s, int64_t length, int64_t capacity,
UErrorCode *status);

UText * utext_openU32(UText *ut,
UChar32 *s, int64_t length, int64_t capacity,
UErrorCode *status);

The following new API function is Proposed as a help function to UText Provider developers:

UText *utext_shallowClone(UText *dest, const UText *src, UErrorCode *status);

A discussion of the general approach is available here: https://werbicki.github.io/c++/2018/12/29/icu-ubidi-utext-enhancements.html

Activity

Paul Werbicki August 16, 2023 at 3:35 PM

thank you for dropping the note - I had forgotten about this work and I still think it has value. Let me refresh this over the next month, and open a Pull Request to be evaluated.

Steven R. Loomis August 9, 2023 at 8:28 PM

i’m seeing this years later, but have you considered opening this as a pull request into ICU now that it is in GitHub?

Paul Werbicki June 24, 2019 at 1:29 PM

This is a quite overdue but the latest version is available at https://github.com/werbicki/icu/tree/UBiDi-and-UText-Enhancements/. Test cases for UBiDi, UShape, and UTransformBidi are all in UTF8/16/32. Major re-organizing was involved to get everything testing cleanly but the original test cases are mostly intact. UBiDi had many areas where UTF16 was assumed, especially around UBIDI_REORDER_RUNS_ONLY mode which have all been addressed. Additional changes to the API are proposed and I will be placing the documentation online later today.

 

Please take a look and leave me your feedback. Thanks!

Paul Werbicki March 21, 2019 at 2:57 AM

The latest version is available at https://github.com/werbicki/icu/tree/UBiDi-and-UText-Enhancements/. This version compiles, and runs the tests, although it currently causes failures in ubiditransform test suite.

The additional tests for UTF8 and UTF32 uncovered an assumption made in ubidi.cpp regarding dirProps making it incompatible with UTF8. Further coding is required but will unfortunately take some time.

I should have the work completed by early April.

Paul Werbicki March 20, 2019 at 1:46 PM

Hi Heba,

The work is complete for the following files:
source/common/ubiditransform.h
source/common/ushape.h
source/common/ubiditransform.cpp
source/common/ushape.cpp
test/cintltst/cbiditst.c

Currently everything in UBiDi works by code-point, no more UTF16 assumptions.

I am currently working on:
test/cintltst/cbiditransformtst.c
test/cintltst/cbididat.c
test/cintltst/cbiditst.c

which has exposed an issue in source/common/ubidiwrt.cpp that is taking extra time.

In order to complete the work I will have to propose several new API functions as well which I have have not finalized.

Let me complete what I can today and update my branch for review.

Details

Assignee

Reporter

Components

Priority

Time Needed

Days

Fix versions

Created January 18, 2019 at 4:35 PM
Updated August 16, 2023 at 3:35 PM