We're updating the issue view to help you get more done. 

RFE: Data Synchronization and Age Verification

Description

Deleted Component: data

= Data Synchronization and Age Verification - 2007 May 9th =

ICU Data is typically built by ICU4C. Beyond the major+minor (i.e. '37') number which is part of standard ICU versioning [ see: 'ICU binary compatibility' in the users guide ], ICU does not contain any versioning on the data files themselves.

With tools to customize ICU data (such as the datacustomizer tool, icupkg from the command line, etc ), it will be increasingly more important to determine what actual version a set of ICU data actually is. There has been discussion over the years of splitting out the data as a separate sub-project, we are getting closer and closer to realizing this (most recently, via a detailed plan that Ram made) and so that sub-project.

Another complication is the frequent lack of synchronization between ICU for C and ICU for J with regards to the contents of maintanance releases. ICU4C version A.B.X and ICU4J version A.B.Y may not have the same data, or even structure of data. This is especially problematic when the ICU4J version of the data does not corespond to any known version of C, making reproducibility of the J build questionable.

This note introduces (A) a new concept, that of a 'Data Version', (B) some policies about the Data Version, (C) proposed storage, shipping and handling of the Data Version, and finally (D) some API ideas dealing with the Data Version.

== A. Data Version ==

ICU's data should have a separate version number altogether from ICU itself. Currently, the ICU release version is embedded within C and J

1 2 J: icu4j/trunk/src/com/ibm/icu/util/VersionInfo.java: ICU_VERSION = getInstance(3, 7, 1, 0); C: icu/trunk/source/common/unicode/uversion.h: #define U_ICU_VERSION "3.7.1"

This document proposes a parallel, but separate, data version

1 2 J: icu4j/trunk/src/com/ibm/icu/util/VersionInfo.java: ICU_DATA_VERSION = getInstance(3, 7, 5, 0); C: icu/trunk/source/common/unicode/uversion.h: #define U_ICU_DATA_VERSION "3.7.5"

( note: here and following, developmental version 3.7 is used as an example, although it is mostly relevant to stable versions 3.6/3.8/etc. )

== B. Policies about Data Version ==

  • The data version must have the same major+minor as ICU release.

  • In the code base, the head version of both C and J for a particular maintenance release must be kept the same. For example, if it is the development timeframe, then the /trunk/ versions of both C and J should be the same (such as 3.7.5.0, above)

  • In case of a maintenance update, the /maint-3.7/ branch of both C and J must be kept in sync as to data version, regardless of whether the other project actually has a release or not. In other words, when the ICU4J 3.7.2 maintenance release comes out and wants new data, ICU4C's /maint-3.7/ trunk must be updated to have a new data version 3.7.6, as does J. This updated ICU4C maint-3.7 should be tagged as "data/data-3.7.6" so that it can reproduced should ICu4J need to rebuild.

Note: It is assumed that newer versions of the data are better. Therefore, after an ICU4J requested update to the data, ICU4C's next maintenance version will pick it up automatically if this scheme were followed.

Data version is incremented as follows:

  • Whenever a new major+minor version of ICU is released, or when the development trunk goes to a development number, the data version follows suit. So, when 3.7 work begins with ICU version 3.7.0.0, the data version is also set to 3.7.0.0. When ICU's version number is incremented to 3.8, then data version is set to 3.8.0.0

  • Whenever a maintenance release comes out, the third number in the version (x.x.#.0) must be incremented by one, and the fourth number set to zero.

  • Whenever a patch release comes out, the fourth number in the version (x.x.x.#) must be incremented by one

--------
Perhaps some side by side examples would help at this point.

  • 3.7
    /trunk ICU4J= 3.7.0.0 C= 3.7.0.0 D= 3.7.0.0
    tag C -> /data/data-3.7.0.0

  • ICU4J requests some data to be updated for ICU4J 3.7.1
    /maint-37 ICU4J= 3.7.1.0 C=3.7.0.0* D=3.7.1.0
    tag C -> /data/data-3.7.1.0

  • ICU4C is going to ship its own 3.7.1, with other updated data, but also incorporating (as policy) the previous data update.
    /maint-37 ICU4J=3.7.1.0* C=3.7.1.0 D=3.7.2.0
    tag C, etc

  • ICU4J realizes that 3.7.1 data had a grave data error ('tlh' locale has extra U+263B in date/time pattern) and must issue a patch.
    /branch-3701 ICU4J=3.7.1.1 C=3.7.0.0* D=3.7.1.1
    tag C -> /data/data-3.7.1.1

  • ICU4C is now going to release ICU4C 3.7.2 , including this data patch and other new data.
    /maint-37 ICU4J=3.7.1.0* C=3.7.2.0 D=3.7.3.0

  • ICU4C releases 3.7.3, with no changes to the data
    /maint-37 ICU4J=3.7.1.0* C=3.7.3.0 D=3.7.3.0

(asterisk * denotes that the ICU version is out of step with a version which already shipped, i.e. data has been updated, but the ICU version number has not updated yet.)

== C. Proposed Storage ==

Proposal is to add two bundles, icuver.res and icustd.res. They could be in their own (non locale) tree if desired.

1 2 3 4 5 6 7 8 icuver { DataVersion { "3.7.3.0" } ICUVersion { "3.7.2.0" } } icustd { StandardICU{} }

icuver contains the data version, as described above, and the version of ICU which it was originally targeting for packaging ( which would be the ICU4C version, given discrepancies marked with an asterisk above. )

icustd is really just a sentinel value. ICU tools that package or repackage data would strip out the icustd file if they found it, except during an original ICU build. [ Issue: should/could the makefile detect the presence of reslocal.mk and other local files, and suppress icustd if they are present? What about standard items removed from the makefile? No easy way to detect modification without "signing" the files ICU APIs would simply detect whether the file is present or not.. ]

( An unrelated proposal is to have the CLDR Version stored in res_index.res )

== D. API Ideas ==

ICU should function correctly with data newer than expected. Problems usually occur if data is older than expected, especially if the user wanted a new ICU to get a newer data set. ( We are not discussing ICU with different major+minor, those are expected to be rejected out of hand.) On the other hand, ICU should be able to fail gracefully with older data, and so it cannot be an automatic hard failure.

This function can be split up any number of ways, but the basic ideas are covered by:

1 UBool u_isDataOlder(UVersionInfo *dataVersionFillin, UBool *isModifiedFillin );

What this function will do is to load up icuver, and compare DataVersion to the wired-in U_ICU_DATA_VERSION. If icuver shows something less than U_ICU_DATA_VERSION it returns true, else false. Additionally, the version found will be returned in the first fillin parameter (if non-nul), and *isModified will be set to true if "icustd" is NOT found. Thus, if the data has been repackaged or modified, "icustd" (standard ICU) will be missing, and the function will alert the caller that the data is not standard.

Status

Assignee

mow@icu-project.org

Reporter

Steven R. Loomis

Labels

None

Reviewer

None

Time Needed

Weeks

Start date

None

Fix versions

Priority

medium