We're updating the issue view to help you get more done. 

RFE: include Lm in word characters for

Description

"\w" can match KATAKANA, but cannot match
"ー"(U+30FC;KATAKANA-HIRAGANA PROLONGED SOUND MARK) .
"ー" is used well with KATAKANA, if anything part of KATAKANA.

For example, if "ICHIRO" is written in KATAKANA, as "イチロー".
But "\w+" matches for the string as "イチロ".

"\w" is equivalent to "_\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]", so "ー" which belong
to
"Lm" category isn't matched by "\w".
This specification is unrealistic for Japanese.

"\b" is also having similar problem.

"Lm" category charactor is following.
At least 3005-30FE and FF9E-FF9E that is used in Japanese should be matched by
"\w".

02B0;MODIFIER LETTER SMALL H
02B1;MODIFIER LETTER SMALL H WITH HOOK
02B2;MODIFIER LETTER SMALL J
02B3;MODIFIER LETTER SMALL R
02B4;MODIFIER LETTER SMALL TURNED R
02B5;MODIFIER LETTER SMALL TURNED R WITH HOOK
02B6;MODIFIER LETTER SMALL CAPITAL INVERTED R
02B7;MODIFIER LETTER SMALL W
02B8;MODIFIER LETTER SMALL Y
02B9;MODIFIER LETTER PRIME
02BA;MODIFIER LETTER DOUBLE PRIME
02BB;MODIFIER LETTER TURNED COMMA
02BC;MODIFIER LETTER APOSTROPHE
02BD;MODIFIER LETTER REVERSED COMMA
02BE;MODIFIER LETTER RIGHT HALF RING
02BF;MODIFIER LETTER LEFT HALF RING
02C0;MODIFIER LETTER GLOTTAL STOP
02C1;MODIFIER LETTER REVERSED GLOTTAL STOP
02C6;MODIFIER LETTER CIRCUMFLEX ACCENT
02C7;CARON
02C8;MODIFIER LETTER VERTICAL LINE
02C9;MODIFIER LETTER MACRON
02CA;MODIFIER LETTER ACUTE ACCENT
02CB;MODIFIER LETTER GRAVE ACCENT
02CC;MODIFIER LETTER LOW VERTICAL LINE
02CD;MODIFIER LETTER LOW MACRON
02CE;MODIFIER LETTER LOW GRAVE ACCENT
02CF;MODIFIER LETTER LOW ACUTE ACCENT
02D0;MODIFIER LETTER TRIANGULAR COLON
02D1;MODIFIER LETTER HALF TRIANGULAR COLON
02E0;MODIFIER LETTER SMALL GAMMA
02E1;MODIFIER LETTER SMALL L
02E2;MODIFIER LETTER SMALL S
02E3;MODIFIER LETTER SMALL X
02E4;MODIFIER LETTER SMALL REVERSED GLOTTAL STOP
02EE;MODIFIER LETTER DOUBLE APOSTROPHE
037A;GREEK YPOGEGRAMMENI
0559;ARMENIAN MODIFIER LETTER LEFT HALF RING
0640;ARABIC TATWEEL
06E5;ARABIC SMALL WAW
06E6;ARABIC SMALL YEH
0E46;THAI CHARACTER MAIYAMOK
0EC6;LAO KO LA
17D7;KHMER SIGN LEK TOO
1843;MONGOLIAN LETTER TODO LONG VOWEL SIGN
1D2C;MODIFIER LETTER CAPITAL A
1D2D;MODIFIER LETTER CAPITAL AE
1D2E;MODIFIER LETTER CAPITAL B
1D2F;MODIFIER LETTER CAPITAL BARRED B
1D30;MODIFIER LETTER CAPITAL D
1D31;MODIFIER LETTER CAPITAL E
1D32;MODIFIER LETTER CAPITAL REVERSED E
1D33;MODIFIER LETTER CAPITAL G
1D34;MODIFIER LETTER CAPITAL H
1D35;MODIFIER LETTER CAPITAL I
1D36;MODIFIER LETTER CAPITAL J
1D37;MODIFIER LETTER CAPITAL K
1D38;MODIFIER LETTER CAPITAL L
1D39;MODIFIER LETTER CAPITAL M
1D3A;MODIFIER LETTER CAPITAL N
1D3B;MODIFIER LETTER CAPITAL REVERSED N
1D3C;MODIFIER LETTER CAPITAL O
1D3D;MODIFIER LETTER CAPITAL OU
1D3E;MODIFIER LETTER CAPITAL P
1D3F;MODIFIER LETTER CAPITAL R
1D40;MODIFIER LETTER CAPITAL T
1D41;MODIFIER LETTER CAPITAL U
1D42;MODIFIER LETTER CAPITAL W
1D43;MODIFIER LETTER SMALL A
1D44;MODIFIER LETTER SMALL TURNED A
1D45;MODIFIER LETTER SMALL ALPHA
1D46;MODIFIER LETTER SMALL TURNED AE
1D47;MODIFIER LETTER SMALL B
1D48;MODIFIER LETTER SMALL D
1D49;MODIFIER LETTER SMALL E
1D4A;MODIFIER LETTER SMALL SCHWA
1D4B;MODIFIER LETTER SMALL OPEN E
1D4C;MODIFIER LETTER SMALL TURNED OPEN E
1D4D;MODIFIER LETTER SMALL G
1D4E;MODIFIER LETTER SMALL TURNED I
1D4F;MODIFIER LETTER SMALL K
1D50;MODIFIER LETTER SMALL M
1D51;MODIFIER LETTER SMALL ENG
1D52;MODIFIER LETTER SMALL O
1D53;MODIFIER LETTER SMALL OPEN O
1D54;MODIFIER LETTER SMALL TOP HALF O
1D55;MODIFIER LETTER SMALL BOTTOM HALF O
1D56;MODIFIER LETTER SMALL P
1D57;MODIFIER LETTER SMALL T
1D58;MODIFIER LETTER SMALL U
1D59;MODIFIER LETTER SMALL SIDEWAYS U
1D5A;MODIFIER LETTER SMALL TURNED M
1D5B;MODIFIER LETTER SMALL V
1D5C;MODIFIER LETTER SMALL AIN
1D5D;MODIFIER LETTER SMALL BETA
1D5E;MODIFIER LETTER SMALL GREEK GAMMA
1D5F;MODIFIER LETTER SMALL DELTA
1D60;MODIFIER LETTER SMALL GREEK PHI
1D61;MODIFIER LETTER SMALL CHI
3005;IDEOGRAPHIC ITERATION MARK
3031;VERTICAL KANA REPEAT MARK
3032;VERTICAL KANA REPEAT WITH VOICED SOUND MARK
3033;VERTICAL KANA REPEAT MARK UPPER HALF
3034;VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF
3035;VERTICAL KANA REPEAT MARK LOWER HALF
303B;VERTICAL IDEOGRAPHIC ITERATION MARK
309D;HIRAGANA ITERATION MARK
309E;HIRAGANA VOICED ITERATION MARK
30FC;KATAKANA-HIRAGANA PROLONGED SOUND MARK
30FD;KATAKANA ITERATION MARK
30FE;KATAKANA VOICED ITERATION MARK
FF70;HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
FF9E;HALFWIDTH KATAKANA VOICED SOUND MARK
FF9F;HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

Status

Assignee

Andy Heninger

Reporter

TracBot

Labels

Components

Fix versions

Priority

major