Support for estimating the monospace display width of a sequence of EGCs in a terminal or console
Description
Activity
Markus Scherer July 13, 2023 at 4:37 PM
Discussion: An estimate will always be off; it would be best if the application could talk to the terminal’s layout engine and fonts.
On the other hand, if there is a competent spec that terminal emulators are going to implement, then we could implement something based on that spec.
Tom Honermann June 16, 2023 at 4:11 PM
, yes, L2/23-107 provides a much more comprehensive analysis of the problem space and I would expect any specification produced by the new WG to suffice to resolve this request once that specification is implemented in ICU.
, you are correct that the best we can do today is to estimate display width based on expectations of how existing terminal applications behave and how fonts intended for use with terminal applications are designed. My goal in filing this issue was to further establish existing practice with the hope that doing so would help to reduce divergent behavior across terminal-based applications. This goal seems to me to match the goal of the new WG and seems likely to require specifying a new class of layout engine and font designed such that character display widths match expectations for display in a grid of fixed width columns; a class that overlaps substantially with monospace fonts (but is probably neither a subset nor a superset).
Rich Gillam June 15, 2023 at 10:56 PM
Is this request covered by the work described in https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf ? The last UTC meeting spawned a working group to discuss this proposal in more detail; I don’t know what the current status of it is, but the relevant section of the UTC minutes said this:
F.6 Complex Script Support in Text Terminals [Li, L2/23-107]
Discussion.
[175-C12] Consensus: UTC supports the formation of a limited duration working group under the PAG to work on text terminal support issues, chaired by Dustin Howett.
[175-A53] Action Item for Dustin Howett: Create a working group with at least the following individuals: Manish Goregaokar, Robin Leroy, Roozbeh Pournader, Mark Shoulson, Liang Hai, Steven Loomis, Jan Kučera, Ned Holbrook; and report to UTC #176 on plans for engagement with other stakeholders and on the development plan for a specification.
Markus Scherer June 15, 2023 at 4:30 PM
Any width estimate without access to the actual layout engine and fonts is always, well, an estimate. It is not clear what a single implementation in a common library should target – it seems like it will always be off for a given actual use case.
Details
Assignee
UnassignedUnassignedReporter
Tom HonermannTom HonermannComponents
Labels
Priority
assessTime Needed
WeeksFix versions
Details
Details
Assignee
Reporter
Components
Labels
Priority

Overview
This is a request for ICU to provide a function that, given a sequence of extended grapheme clusters, returns an estimate of the number of display columns that a typical computer terminal or console that uses a monospaced font would consume when displaying the text.
Estimated display widths in C++
C++20 introduced (via P0645) a new type safe
std::format
facility as an alternative to the venerableprintf
function inherited from C. This facility was extended (via P1868, still in C++20) to improve on the ability to format text for tabular display in a typical terminal/console configured to use a monospace font but where some characters are displayed such that they consume more than one display column. The original method used to estimate the display width of a character sequence was based on a specified set of code point ranges. A new method was adopted (via P2675, for C++23 but as a defect report for C++20) that computes the display width based on character properties (East_Asian_Width
="W" andEast_Asian_Width
="F") and assigned code blocks (Yijing Hexagram Symbols, Miscellaneous Symbols and Pictographs, and Supplemental Symbols and Pictographs). Analysis and testing revealed that this approach 1) better matches the algorithms used by common terminal emulators, and 2) better matches the observed behavior on multiple operating systems. The specification for this behavior in the working draft of the C++ standard can be found in [format.string.std]p13.The C++ standard currently assigns a display width of either 1 or 2 to every EGC. This is incorrect for some EGCs. In particular, EGC’s that have a base character of one of the following have different behavior (see P2572 for related discussion of use of these characters as fill characters).
U+0007 BELL
U+0008 BACKSPACE
U+007F DELETE
U+0009 CHARACTER TABULATION
U+000A LINE FEED (LF)
U+000B LINE TABULATION
U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR)
U+0085 NEXT LINE (NEL)
U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM
Various other control characters
A fairly comprehensive review of the behavior of various terminals and consoles with respect to the characters that changed estimated width with the adoption of the P2675 proposal is available here (the broken image link there is intended to link to this image).
Display widths in common terminal and console applications
Evidence that the methodology adopted via P2675 is reflective of existing practice is available as described below. Please note that I have not extensively studied the referenced sources nor how they impact the behavior of the referenced application.
Microsoft Terminal and Microsoft Console
Source code for Microsoft’s traditional Console and new Terminal program are available at . A Power Shell script, Generate-CodepointWidthsFromUCD.ps1, is used to generate code point ranges for character width classification. It consults UCD properties and an XML file, unicode_width_overrides.xml; the latter specifies code blocks to be treated as wide regardless of UCD properties. The script is used to generate CodepointWidthDetector.cpp.
Gnome Virtual TErminal (VTE)
Source code for Gnome’s Virtual TErminal (VTE) is available at . It contains a document, ambiguous.txt, that describes the general approach taken to classify character width.
KDE Konsole
Source code for KDE’s Konsole is available at . Width classification is performed by uni2characterwidth.cpp in consultation with UCD properties. An adjacent overrides.txt file indicates ranges of code points for which classification presumably differs from what would otherwise be determined by UCD properties.
ITerm2
ITerm2 is a terminal emulator for macOS. Source code for it is available at . Character width classification is performed by NSCharacterSet+iTerm.m with assistance from an eastasian.py Python script.
Conclusion
It is clear that the specification that C++ is currently using is an approximation of behavior across various platforms and that it does not produce the desired results in all cases. Likewise, it is clear that various terminal and console applications depend on correct character width classification in order to behave as intended. In the above cases, each of the referenced projects has gone to considerable effort to try to ensure expected behavior across a large portion of the Unicode character set; including emoji. An extensive review of the classification strategies employed by each seems likely to reveal differences in behavior, at least some of which are no doubt unintended.
The ecosystem would benefit from a character width classification interface exposed by ICU. Given ICU’s centrality and ubiquity, the availability of such an interface would 1) avoid redundant work that requires expert level Unicode knowledge, 2) provide a defacto industry standard reference for monospace font designers and terminal/console developers, and 3) enable an iterative approach to refining, improving, and unifying display width expectations for monospace applications.