Change in SimpleDateFormat parsing behavior for certain numeric tokens

Description

Hi all, As described in a previous issue ICU-22179, there has been a change in a long-standing ICU behavior when using strict SimpleDateFormat to parse certain numeric tokens. The issue seems to be introduced in ICU71. The following tokens have changed their behavior - for parsing, they now require the numeric field to be at least the length of the token:

All the above-mentioned values used to be parsed successfully for the corresponding tokens before ICU71 but some of them now error. Note that other numeric tokens like MM, HH, yy, still follow the old rules and would accept a single-digit value in ICU71.

This issue was raised in ICU-22179 and we understand that it was mentioned that this is behaving as expected and "strict parsing should be strict". However, this behavior has been in ICU for years and we use ICU a lot for parsing and formatting dates and times. Over the years our users (and others as mentioned in ICU-22179) have started relying on this behavior. So it would be great if there was a way to preserve the old behavior, maybe through an explicit attribute, so that people who want the old behavior, for compatibility reasons, can still get it.

A couple of things that were suggested in ICU-22179 that do not help us address this issue:

  1. Setting the calendar/formatter to lenient does not work for us, as it makes the formatter lenient for a bunch of other cases where we prefer strict behavior.

  2. Another suggestion in ICU-22179 was to use one of the DateFormat.BooleanAttribute, however, none of the existing attributes provide a way to control this specific behavior.

  3. Since this impacts existing code written by other users of our software, changing the format to d/M/u (and all the other impacted tokens) is not feasible.

The following cpp code successfully parses the string in ICU70 but fails starting ICU71:

Activity

Show:
Peter Edberg
March 30, 2023 at 8:54 PM

It is. And fixed under that ticket, with some tests added that specific to this ticket.

Peter Edberg
March 29, 2023 at 8:35 PM

This ticekt may be a duplicate of https://unicode-org.atlassian.net/browse/ICU-22337 (or vice versa)

Siddharth Bhutiya
February 10, 2023 at 3:08 PM

Just wanted to mention this here, in case it helps with the investigation. I was going through the code and it seems that the behavior changed after this https://unicode-org.atlassian.net/browse/ICU-21802.

Markus Scherer
February 9, 2023 at 5:55 PM

Rich to look at whether this change was intentional (ticket and/or API change proposal).

Could this have to do with parsing immediately-adjacent numeric fields (without separator)? See the examples above.

We should have good test coverage for the detailed parsing behavior.

It does seem weird that it’s differently strict for parsing days vs. months.

We should document the intended behavior.

Siddharth Bhutiya
February 3, 2023 at 7:01 PM

One clarification, we are experiencing this in ICU4C. The other issue mentioned in the description https://unicode-org.atlassian.net/browse/ICU-22179 was for ICU4J.

Fixed by Other Ticket

Details

Assignee

Reporter

Components

Priority

Time Needed

Days

Fix versions

Created February 3, 2023 at 6:44 PM
Updated March 30, 2023 at 8:54 PM
Resolved March 30, 2023 at 8:54 PM