Support for estimating the monospace display width of a sequence of EGCs in a terminal or console

Description

Overview

This is a request for ICU to provide a function that, given a sequence of extended grapheme clusters, returns an estimate of the number of display columns that a typical computer terminal or console that uses a monospaced font would consume when displaying the text.

Estimated display widths in C++

C++20 introduced (via P0645) a new type safe std::format facility as an alternative to the venerable printf function inherited from C. This facility was extended (via P1868, still in C++20) to improve on the ability to format text for tabular display in a typical terminal/console configured to use a monospace font but where some characters are displayed such that they consume more than one display column. The original method used to estimate the display width of a character sequence was based on a specified set of code point ranges. A new method was adopted (via P2675, for C++23 but as a defect report for C++20) that computes the display width based on character properties (East_Asian_Width="W" and East_Asian_Width="F") and assigned code blocks (Yijing Hexagram Symbols, Miscellaneous Symbols and Pictographs, and Supplemental Symbols and Pictographs). Analysis and testing revealed that this approach 1) better matches the algorithms used by common terminal emulators, and 2) better matches the observed behavior on multiple operating systems. The specification for this behavior in the working draft of the C++ standard can be found in [format.string.std]p13.

The C++ standard currently assigns a display width of either 1 or 2 to every EGC. This is incorrect for some EGCs. In particular, EGC’s that have a base character of one of the following have different behavior (see P2572 for related discussion of use of these characters as fill characters).

  • U+0007 BELL

  • U+0008 BACKSPACE

  • U+007F DELETE

  • U+0009 CHARACTER TABULATION

  • U+000A LINE FEED (LF)

  • U+000B LINE TABULATION

  • U+000C FORM FEED (FF)

  • U+000D CARRIAGE RETURN (CR)

  • U+0085 NEXT LINE (NEL)

  • U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM

  • Various other control characters

A fairly comprehensive review of the behavior of various terminals and consoles with respect to the characters that changed estimated width with the adoption of the P2675 proposal is available here (the broken image link there is intended to link to this image).

Display widths in common terminal and console applications

Evidence that the methodology adopted via P2675 is reflective of existing practice is available as described below. Please note that I have not extensively studied the referenced sources nor how they impact the behavior of the referenced application.

Microsoft Terminal and Microsoft Console

Source code for Microsoft’s traditional Console and new Terminal program are available at . A Power Shell script, Generate-CodepointWidthsFromUCD.ps1, is used to generate code point ranges for character width classification. It consults UCD properties and an XML file, unicode_width_overrides.xml; the latter specifies code blocks to be treated as wide regardless of UCD properties. The script is used to generate CodepointWidthDetector.cpp.

Gnome Virtual TErminal (VTE)

Source code for Gnome’s Virtual TErminal (VTE) is available at . It contains a document, ambiguous.txt, that describes the general approach taken to classify character width.

KDE Konsole

Source code for KDE’s Konsole is available at . Width classification is performed by uni2characterwidth.cpp in consultation with UCD properties. An adjacent overrides.txt file indicates ranges of code points for which classification presumably differs from what would otherwise be determined by UCD properties.

ITerm2

ITerm2 is a terminal emulator for macOS. Source code for it is available at . Character width classification is performed by NSCharacterSet+iTerm.m with assistance from an eastasian.py Python script.

Conclusion

It is clear that the specification that C++ is currently using is an approximation of behavior across various platforms and that it does not produce the desired results in all cases. Likewise, it is clear that various terminal and console applications depend on correct character width classification in order to behave as intended. In the above cases, each of the referenced projects has gone to considerable effort to try to ensure expected behavior across a large portion of the Unicode character set; including emoji. An extensive review of the classification strategies employed by each seems likely to reveal differences in behavior, at least some of which are no doubt unintended.

The ecosystem would benefit from a character width classification interface exposed by ICU. Given ICU’s centrality and ubiquity, the availability of such an interface would 1) avoid redundant work that requires expert level Unicode knowledge, 2) provide a defacto industry standard reference for monospace font designers and terminal/console developers, and 3) enable an iterative approach to refining, improving, and unifying display width expectations for monospace applications.

Activity

Show:

Markus Scherer July 13, 2023 at 4:37 PM

Discussion: An estimate will always be off; it would be best if the application could talk to the terminal’s layout engine and fonts.

On the other hand, if there is a competent spec that terminal emulators are going to implement, then we could implement something based on that spec.

Tom Honermann June 16, 2023 at 4:11 PM

, yes, L2/23-107 provides a much more comprehensive analysis of the problem space and I would expect any specification produced by the new WG to suffice to resolve this request once that specification is implemented in ICU.

, you are correct that the best we can do today is to estimate display width based on expectations of how existing terminal applications behave and how fonts intended for use with terminal applications are designed. My goal in filing this issue was to further establish existing practice with the hope that doing so would help to reduce divergent behavior across terminal-based applications. This goal seems to me to match the goal of the new WG and seems likely to require specifying a new class of layout engine and font designed such that character display widths match expectations for display in a grid of fixed width columns; a class that overlaps substantially with monospace fonts (but is probably neither a subset nor a superset).

Rich Gillam June 15, 2023 at 10:56 PM

Is this request covered by the work described in https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf ? The last UTC meeting spawned a working group to discuss this proposal in more detail; I don’t know what the current status of it is, but the relevant section of the UTC minutes said this:

F.6 Complex Script Support in Text Terminals [Li, L2/23-107]

Discussion.

[175-C12] Consensus: UTC supports the formation of a limited duration working group under the PAG to work on text terminal support issues, chaired by Dustin Howett.

[175-A53] Action Item for Dustin Howett: Create a working group with at least the following individuals: Manish Goregaokar, Robin Leroy, Roozbeh Pournader, Mark Shoulson, Liang Hai, Steven Loomis, Jan Kučera, Ned Holbrook; and report to UTC #176 on plans for engagement with other stakeholders and on the development plan for a specification.

Markus Scherer June 15, 2023 at 4:30 PM

Any width estimate without access to the actual layout engine and fonts is always, well, an estimate. It is not clear what a single implementation in a common library should target – it seems like it will always be off for a given actual use case.

Details

Assignee

Reporter

Components

Priority

Time Needed

Weeks

Fix versions

Created April 3, 2023 at 4:15 AM
Updated July 13, 2023 at 4:38 PM