W3C TTWG re: CLDR supplemental data for subtitle and caption characters

Description

The W3C Timed Text Working Group (TTWG) http://www.w3.org/AudioVideo/TT/ develops specifications
for subtitle and caption delivery applications. It has, in the
process, collected sets of characters (for selected locales) that have
been proven useful in the latter. These sets, documented at https://dvcs.w3.org/hg/ttml/raw-file/tip/ttml-ww-profiles/ttml-ww-profiles.
html#recommended-unicode-code-points-per-language , are
derived in part from the analysis of video content intended for consumption
in homes, created by broadcasters or other content providers. The TTWG
notes that Unicode CLDR does not include characters specifically
intended for subtitling/captioning application, e.g. the QUARTER NOTE
(U+2669) character.

The TTWG therefore suggests that Unicode consider adding the following
"Supplemental Subtitle/Caption Character Data" to the CLDR
supplemental data http://www.unicode.org/reports/tr35/tr35-39/tr35-info.html .

https://dvcs.w3.org/hg/ttml/raw-file/bc0f3b1a9104/ttml-ww-profiles/cldr-supplemental-data/cldr-sub-cap-supplemental-data.xml

This data is organized in localized sets of characters

(<localizedSet>) that can include, by reference, common sets of
characters (<commonSet>) that are reused across multiple localized
sets.

The localized sets would be maintained according to the following
(non-mutually-exclusive) rules:

  • each localized set should include all characters that can be found
    in subtitle/captions intended for presentation in the locale

  • each localized set should include the 'base' common set

  • when adding exemplar characters to the main, punctuation or number
    sets associated with a locale, the same characters should be added to
    the corresponding <localizedSet> unless inappropriate for
    subtitle/caption applications

The objective of this request is to encourage the creation of a common
set of characters for subtitling and captioning applications that can
be referenced by W3C and other organizations, enhancing the chances
that subtitles/captions are presented consistently across systems.

TTWG is available to provide additional information and looks forward
to hearing from, and working with, the Unicode consortium.

Kind regards,

Nigel Megitt (nigel.megitt@bbc.co.uk)
Co-chair, W3C Timed Text Working Group

[ticket expands and formalizes |This]

xpath

None

locale

None

Activity

Show:
Addison Phillips
May 20, 2020, 4:10 PM

Pinging this issue, which has been unscheduled for a long time. Do you need anything from W3C’s WGs (I18N or TTML)?

TracBot
May 9, 2019, 11:50 PM
Trac Comment 11 by —2018-10-17T15:34:50.280Z

CLDR 34 BRS closing item, move all upcoming → UNSCH

TracBot
May 9, 2019, 11:50 PM
Trac Comment 10 by addison@a2a8283864386ca0—2018-04-02T19:06:13.346Z

I was tasked [1] by the W3C I18N WG with adding to this thread. We recently [2] discussed this issue with the IMSC folks, resulting in some changes to the terminology they are using (see [3]). My action item here is to ask that CLDR consider the requests in this ticket carefully. Captions do have some specific common character usage that is different from normal/regular usage (they cite the use of the musical note character, for example). W3C-I18N thinks that IMSC maintaining any sort of character list is counter-productive compared to using CLDR as a reference.

[1] https://www.w3.org/International/track/actions/699
[2] https://lists.w3.org/Archives/Public/www-international/2018JanMar/0126.html
[3] https://github.com/w3c/imsc/issues/236

TracBot
May 9, 2019, 11:50 PM
Trac Comment 9 by pal@2ea29fe006e555f7—2016-12-05T06:08:21.982Z

Input from the IMSC1 editor below:

1) Ok.

2)a) Yes.

2)b) Yes.

3) Tables 1 and 2 of IMSC1 were contributed by the DECE consortium (http://www.uvcentral.com/specs) and were derived from the analysis of subtitle content from actual home video titles, i.e. Blu Ray and DVD.

The recommended set for a given language (Table 2) always includes the the Common Character Set of Table 1 (i.e. basic latin characters and common symbols). Thus the Hebrew set includes all latin characters of Table 1, in addition to the Hebrew specific characters listed in Table 2.

The sets typically cast a broader rather than narrower net, presumably to avoid missing characters and to reflect subtitling practices, e.g. the "pl" set includes all of Latin Extended-A.

As requested, the file at [2] lists the characters that are included in Table 2 of IMSC1, but not included in the union of the main, auxiliary, punctuation exemplarCharacters and (ii) symbols and defaultNumberingSystem characters. [the significant differences can be traced to the inclusion of the entire Latin Extended-A block and significant portions of the Cyrillic block for selected European sets.|ed.:].

[1] https://www.w3.org/TR/ttml-imsc1/#recommended-unicode-code-points-per-language
[2] http://www.sandflow.com/public/CLDR-report-20161204.txt

TracBot
May 9, 2019, 11:50 PM
Trac Comment 8 by —2016-11-30T14:30:58.118Z

A few comments.

  1. I agree that this does appear to be better suited as an additional exemplar set for the languages in question. We could add, for example, a
    <exemplarCharacters type="caption">...</exemplarCharacters>.
    2) However, we need to get a better sense of the usage. The phrasing in "7.2 Recommended Character Sets" appears ambiguous. "A Document Instance should be authored using characters selected from the sets specified in B. Recommended Character Sets."
    a) I assume that it means "...//only// characters selected...".
    b) I also guess that there is an implicit directive for device suppliers, that they //should// support all the characters listed for each language their device supports. But that is just a guess. It would be useful to get a clarification of this clause.
    3) I would like to see a comparison of the characters in https://www.w3.org/TR/ttml-imsc1/#recommended-unicode-code-points-per-language to the main / aux characters for the languages in question. It may be that we can simply add the characters to aux to address some of the instances.
    a) However, it appears odd to require that say that Hebrew support requires supporting U+0178 : LATIN CAPITAL LETTER Y WITH DIAERESIS but not other Latin-script characters that are as likely to be found intermixed with Hebrew text. It would be useful background for us to understand some of the reasoning behind the selection of characters that is made, or whether these were just hovered up from existing captioning systems. The reason doesn't prevent us from adding the exemplar sets, but can help in our documentation to explain the reasoning behind the inclusion (eg, for compatibility with systems X, Y, and Z).

Priority

assess

Assignee

Shervin Afshar

Reporter

TracBot

Components