Asymmetric search capability

Description

Support asymmetric search, i.e. type e, match e,é,è; type é, match é but probably not e and certainly not è. Requires design proposal and API enhancements.

Activity

Show:
TracBot
June 30, 2018, 11:31 PM
Trac Comment 3 by —2010-01-15T01:07:31.000Z

Sent proposal to icu-design on 2009-Dec-11, it was approved. The goal is to enable the following types of behavior:

A. Example: When searching using an English collator set to UCOL_SECONDARY or lower (UCOL_TERTIARY…):

1. A base letter ('e') in search pattern matches any secondarily-equivalent letter in the target searched text: e, è, é, ê, ë, …

'''AND POSSIBLY ALSO'''

2. A base letter in the target searched text matches any secondarily-equivalent letter in the search pattern (é in search pattern matches e in text).

'''BUT NOT''' the following (to do this, you would use a collator set to UCOL_PRIMARY):

3. An accented letter in the search pattern matches a letter in the target searched text with the same base but a different accent, i.e. é matches ë.

B. Example: When searching using an English collator set to UCOL_TERTIARY, behaviors A1 and A2 above plus:

1. Normal e in the search pattern matches any of the following in the target: fullwidth e, circled e, etc.

'''AND POSSIBLY'''

2. Fullwidth e in the search pattern matches normal e in the target

'''BUT NOT'''

3. Fullwidth e in the search pattern matches circled e in the target.

The idea is that the base/normal form of a character in the search pattern, or possibly in the both the pattern and the target text, should be treated as a wildcard representing all of the other characters that are equivalent at a non-primary level. This can be implemented by optionally modifying how collation elements are compared in usearch_next, usearch_previous, etc. with an attribute on UStringSearch. So my API proposal is:

Add the following for use with usearch_setAttribute / usearch_getAttribute:

  • a new value for USearchAttribute,

    USEARCH_ELEMENT_COMPARISON

  • corresponding new values for USearchAttributeValue,

    USEARCH_STANDARD_ELEMENT_COMPARISON (the current behavior)

    USEARCH_PATTERN_BASE_WEIGHT_IS_WILDCARD (this corresponds to #1 in the examples above)

    USEARCH_PATTERN_AND_TEXT_BASE_WEIGHT_IS_WILDCARD (this corresponds to #1 and #2 in the examples above)

(Here "base" does not mean the base character of a combining sequence, but rather the form that for a given secondary or tertiary collation level has the base weight, currently 0x05 in the ucol implementation; suggestions for alternate terminology are welcome)

Rough implementation sketch, ignoring details:

In usearch, collation elements to be compared are masked to zero out fields corresponding to weights below the strength of the collator. Also, accented letters such as 'é' in which (for the specified collator) the accent is not significant at primary level generate a sequence of collation elements in which the first is for the base letter and has non-zero primary weight, and the trailing elements have zero primary weight. So for option 1:

  • When comparing two elements with identical primary collation weights and differing weights at other levels, if at each other level where the fields differ, the weight from the search pattern's element has the base value (0x05) for the field, then the collation elements are treated as equal (for option 2, if the weight from either element has the base value, then the elements are treated as equal).

  • When comparing two elements with different primary weights, if the element from the search pattern has primary weight 0 and the element from the target does not, then skip that element from the search pattern (treat as all 0) and instead compare the same target string element to the next element from the search pattern.

TracBot
June 30, 2018, 11:31 PM
Trac Comment 4 by —2010-01-15T01:09:12.000Z

So far, to meet the API freeze deadline, I have checked in the API-related changes (the new attribute and value constants, and the code that sets/gets them in a new field in USearch).

TracBot
June 30, 2018, 11:31 PM
Trac Comment 8 by —2016-10-05T23:14:33.489Z

Milestone 4.3.5 deleted

Fixed

Assignee

Peter Edberg

Reporter

Peter Edberg

Components

Labels

None

Reviewer

None

Priority

major

Time Needed

None

Fix versions