Investigate anomalies in wikidata language groups
Description
Attachments
Activity
I’m sorry this ticket is confusing. I think all of the titles and metadata are correct.
https://unicode-org.atlassian.net/browse/CLDR-15020 is to fix bad language data.
https://unicode-org.atlassian.net/browse/CLDR-17418 is a subtask to fix tool issues found while investigating the issue. This has a merged PR and is done.
Annemarie, I think you’ve twice put CLDR-15020 into review incorrectly just because there’s a PR referencing it. The PR references it because it’s very much related. Now we have the ‘merged’ field so hopefully this will happen less often.
I see this ticket CLDR-15020 was reassigned to me - ok, set it for 48. Mark you still have https://unicode-org.atlassian.net/browse/CLDR-17076
I will work on communicating with Wikidata.
Commit checker seems to think there’s no commit? Does this belong to a different fix version, or? I think maybe this is fixed by another ticket and/or we should move this out to 48 if work still is needed?
I can’t see all the commits anymore; Development just hangs. See attachment.
I’m confused about the status of this ticket. The PR and commit have different ticket numbers, and the commit references this ticket.
Most of a year later, there are still loops (running GenerateLanguageContainment
):
Details
Details
Components
Labels
Priority

Goal is to communicate issues to Wikidata.
See GenerateLanguageContainment.java
(also document further how to repair problems)
There are duplicate codes from wikidata, and cycles in the grouping. See the two lists below.
drl has multiple entries Bandjigali (drl, Q1042592), Darling (drl, Q522394)
gdj has multiple entries Kunggara (gdj, Q1055018), Gurdjar (gdj, Q561993)
xny has multiple entries Nijadali (xny, Q1060072), Nyiyaparli (xny, Q1691942)
uz has multiple entries Q106676 (uz, Q106676), Uzbek (uz, Q926)
kzk has multiple entries Kazukuru (kzk, Q108906), Guliguli (kzk, Q312076), Dororo (kzk, Q5692)
lcq has multiple entries Q1096326 (lcq, Q1096326), Luhu (lcq, Q669989)
sev has multiple entries Cebaara (sev, Q1097512), Nyarafolo (sev, Q3630)
man has multiple entries Eastern Maninkaka (man, Q1100213), Mandinka (man, Q3367), Manding languages (man, Q3577)
suj has multiple entries Shubi (suj, Q1100536), Subi (suj, Q763129)
aqd has multiple entries Kolum So Dogon (aqd, Q1100634), Ampari Dogon (aqd, Q474805)
opa has multiple entries Q1101153 (opa, Q1101153), Okpamheri (opa, Q391333)
yam has multiple entries Q1101418 (yam, Q1101418), Yamba (yam, Q3690)
scv has multiple entries Sheni (scv, Q1101582), Ziriya (scv, Q391394)
kdz has multiple entries Kwaja (kdz, Q1112886), Ndaktup (kdz, Q3612)
duz has multiple entries Q1113781 (duz, Q1113781), Duli (duz, Q531340)
grb has multiple entries Northern Grebo (grb, Q1115704), Grebo (grb, Q3525)
ktz has multiple entries Juǀʼhoansi (ktz, Q119229), ǂKxʼauǁʼein (ktz, Q247874)
kg has multiple entries Q11929909 (kg, Q1192990), Kongo (kg, Q3370)
et has multiple entries Q1236154 (et, Q1236154), Estonian (et, Q907)
rki has multiple entries Q12628927 (rki, Q1262892), Arakanese (rki, Q345074)
amq has multiple entries Elpaputih (amq, Q1263056), Amahai (amq, Q332738)
wny has multiple entries Q12631364 (wny, Q1263136), Wanyi (wny, Q796820)
bal has multiple entries Southern Balochi (bal, Q1263400), Baluchi (bal, Q3304)
raq has multiple entries Q1263553 (raq, Q1263553), Saam (raq, Q739564)
ncq has multiple entries Northern Katang (ncq, Q1263817), Kataang (ncq, Q1295362)
om has multiple entries West Central Oromo (om, Q1263901), Oromo (om, Q3386)
apf has multiple entries Paranan (apf, Q1263929), Paranan Agta (apf, Q713543)
hle has multiple entries Q12641739 (hle, Q1264173), Hlersu (hle, Q587353)
ps has multiple entries Q1264219 (ps, Q1264219), Pashto (ps, Q5868)
srx has multiple entries Western Pahari (srx, Q1264557), Sirmauri (srx, Q753050)
mom has multiple entries Q12645985 (mom, Q1264598), Chorotega (mom, Q5654)
nol has multiple entries Wintuan languages (nol, Q129425), Nomlaki (nol, Q334322)
ers has multiple entries Ersu (ers, Q1295241), Lizu (ers, Q666065)
gvr has multiple entries Q1295247 (gvr, Q1295247), Gurung (gvr, Q239234)
zkd has multiple entries Kado (zkd, Q1295253), Kadu (zkd, Q5432445)
lrr has multiple entries Southern Lorung (lrr, Q1295274), Yamphe (lrr, Q1295331)
vaj has multiple entries Q1295278 (vaj, Q1295278), Q2555939 (vaj, Q2555939), Sekele (vaj, Q383297), Sekele (vaj, Q5652)
mry has multiple entries Q1295279 (mry, Q1295279), Q2555952 (mry, Q2555952), Mandaya (mry, Q674792)
cir has multiple entries Q1295283 (cir, Q1295283), Tîrî (cir, Q786228)
ngv has multiple entries Ngong (ngv, Q1295291), Nagumi (ngv, Q3584)
adx has multiple entries Q1295301 (adx, Q1295301), Amdo Tibetan (adx, Q5650)
gal has multiple entries Q1295317 (gal, Q1295317), Galoli (gal, Q3532)
taj has multiple entries Southwestern Tamang (taj, Q1295317), Eastern Tamang (taj, Q1295317)
weo has multiple entries Q1295329 (weo, Q1295329), Wemale (weo, Q798216)
acn has multiple entries Q1295330 (acn, Q1295330), Achang (acn, Q5658)
zom has multiple entries Q12953340 (zom, Q1295334), Zo (zom, Q3701)
dtp has multiple entries Q1295351 (dtp, Q1295351), Kota Marudu Tinagas (dtp, Q1864228), Coastal Kadazan (dtp, Q330719), Tempasuk Dusun (dtp, Q352915), Kadazandusun (dtp, Q531722)
tdf has multiple entries Kasseng (tdf, Q1295362), Talieng (tdf, Q3752510)
rmx has multiple entries Lamam (rmx, Q1295366), Romam (rmx, Q2269460)
mib has multiple entries Atatláhuca–San Miguel Mixtec (mib, Q1295372), Atatláhuca Mixtec (mib, Q3209304)
oyb has multiple entries Sok (oyb, Q1295388), Oi (oyb, Q1359374), Cheng (oyb, Q509127)
yi has multiple entries Eastern Yiddish (yi, Q1295398), Yiddish (yi, Q864)
quh has multiple entries South Bolivian Quechua (quh, Q130773), Chilean Quechua (quh, Q2555970)
ku-Latn has multiple entries Q1317548 (ku-Latn, Q1317548), Q6436299 (ku-Latn, Q6436299)
raj has multiple entries Rajasthani (raj, Q1319), Malvi (raj, Q3341)
rom has multiple entries Romani (rom, Q1320), Q266919 (rom, Q266919)
zap has multiple entries Zapotec (zap, Q1321), Isthmus Zapotec (zap, Q5672)
za has multiple entries Zhuang (za, Q1321), Yongbei Zhuang (za, Q805498)
den has multiple entries Slavey (den, Q1327), Denetaca (den, Q2855)
lah has multiple entries Lahnda languages (lah, Q133477), Western Punjabi (lah, Q138949)
bho has multiple entries Bihari (bho, Q13530), Bhojpuri (bho, Q3326)
ar has multiple entries Arabic (ar, Q1395), Q5646 (ar, Q5646)
fr-FR has multiple entries Q140021 (fr-FR, Q140021), Q308319 (fr-FR, Q308319)
ff has multiple entries Pulaar (ff, Q142020), Fula (ff, Q3345)
ms has multiple entries Malaysian (ms, Q1506), Malay (ms, Q923)
mg has multiple entries Plateau Malagasy (mg, Q1506930), Malagasy (mg, Q793)
enl has multiple entries Enlhet (enl, Q1546267), Lengua (enl, Q312106)
cas has multiple entries Mosetenan languages (cas, Q1554803), Chimane (cas, Q3595)
kr has multiple entries Central Kanuri (kr, Q1563721), Q3609 (kr, Q3609)
aou has multiple entries A'ou (aou, Q1610999), Q5640 (aou, Q5640)
wgb has multiple entries Q16112427 (wgb, Q1611242), Wagawaga (wgb, Q795948)
cmr has multiple entries Khumi Awa Chin (cmr, Q1611438), Mro (cmr, Q1688997)
azd has multiple entries Eastern Durango Nahuatl (azd, Q1611544), Mexicanero (azd, Q238636)
xnt has multiple entries Mohegan-Montauk-Narragansett (xnt, Q1611612), Narragansett (xnt, Q333611)
bua has multiple entries Russia Buriat (bua, Q1611662), Buryat (bua, Q3312)
gn has multiple entries Paraguayan Guaraní (gn, Q1747806), Guarani (gn, Q3587)
jgk has multiple entries Gwak (jgk, Q1752369), Jarawa (jgk, Q3542)
bwx has multiple entries Q1762236 (bwx, Q1762236), Q1762583 (bwx, Q1762583), Q292826 (bwx, Q292826), Q292826 (bwx, Q292826), Bu-Nao (bwx, Q5641)
gon has multiple entries Gondi (gon, Q177536), Northern Gondi (gon, Q1981160)
sq has multiple entries Q18093 (sq, Q18093), Albanian (sq, Q874)
hy has multiple entries Q18105 (hy, Q18105), Q878 (hy, Q878)
cax has multiple entries Chiquitano (cax, Q184499), Q635624 (cax, Q635624)
egy has multiple entries Late Egyptian (egy, Q185232), Q5086 (egy, Q5086)
fbl has multiple entries West Albay Bikol (fbl, Q1860380), Albay Bikol (fbl, Q470946)
isk has multiple entries Sanglechi-Ishkashimi (isk, Q1871123), Ishkashimi (isk, Q3341)
gxx has multiple entries Southern Wee (gxx, Q1992158), Guéré (gxx, Q3691)
hai has multiple entries Northern Haida (hai, Q2005448), Haida (hai, Q3330)
dih has multiple entries Diegueño (dih, Q2006491), Tipai (dih, Q302747)
bpp has multiple entries Kaure (bpp, Q2052653), Narau (bpp, Q696545)
ay has multiple entries Central Aymara (ay, Q2052661), Aymara (ay, Q462)
kpe has multiple entries Liberia Kpelle (kpe, Q2052722), Kpelle (kpe, Q3567)
bxu has multiple entries China Buriat (bxu, Q2052741), Q288449 (bxu, Q288449), Q348218 (bxu, Q348218)
mtm has multiple entries Q2066941 (mtm, Q2066941), Karagas (mtm, Q3375), Mator (mtm, Q3645)
ola has multiple entries Thudam (ola, Q2267482), Walungge (ola, Q2267484)
opt has multiple entries Opata (opt, Q230458), Q7887299 (opt, Q788729)
bfy has multiple entries Bagheli (bfy, Q235636), Q713206 (bfy, Q713206)
ulw has multiple entries Ulwa (ulw, Q240555), Sumo languages (ulw, Q97106)
map has multiple entries Northern Tsou languages (map, Q244999), Austronesian (map, Q4922)
kwv has multiple entries Q2555931 (kwv, Q2555931), Kaba (kwv, Q391536)
pub has multiple entries Q25559435 (pub, Q2555943), Purum (pub, Q640056)
dev has multiple entries Gabutamon (dev, Q2555946), Domung (dev, Q529137)
ngt has multiple entries Ngeq (ngt, Q2555954), Khlor (ngt, Q2792140)
dif has multiple entries Diyari (dif, Q2555956), Dhirari (dif, Q528503)
css has multiple entries Southern Ohlone (css, Q2555966), Mutsun (css, Q311186), Rumsen (css, Q345318)
ik has multiple entries Northwest Alaska Inupiatun (ik, Q2555971), Inupiat (ik, Q2718)
del has multiple entries Delaware languages (del, Q266576), Munsee (del, Q5654)
ikt has multiple entries Inuvialuktun (ikt, Q2799), Inuinnaqtun (ikt, Q2807)
ak has multiple entries Akan (ak, Q2802), Fante (ak, Q3557), Twi (ak, Q3685)
xsc has multiple entries Scythian-Sarmatian (xsc, Q2845119), Scythian (xsc, Q74983)
arc has multiple entries Aramaic (arc, Q2860), Imperial Aramaic (arc, Q707949)
iu has multiple entries Inuktitut (iu, Q2992), Eastern Canadian Inuktitut (iu, Q412651)
xdy has multiple entries Q302146 (xdy, Q302146), Bamayo (xdy, Q351489)
gwd has multiple entries Gawwada (gwd, Q303213), Q312796 (gwd, Q312796)
gba has multiple entries Gbaya languages (gba, Q309998), Northwest Gbaya (gba, Q3659)
bcg has multiple entries Pukur (bcg, Q3117266), Baga Mboteni (bcg, Q3487)
hmn has multiple entries Northern Qiandong Miao (hmn, Q313883), Hmongic languages (hmn, Q904077)
huw has multiple entries Hukumina (huw, Q314298), Q712882 (huw, Q712882)
kak has multiple entries Tinoc Kallahan (kak, Q319221), Kalanguya (kak, Q319222)
sw has multiple entries Swahili (sw, Q319753), Swahili (sw, Q783)
kxr has multiple entries Koro (kxr, Q319899), Papitalai (kxr, Q652865)
kru has multiple entries Nepali Kurux (kru, Q320062), Kurukh (kru, Q3349)
ksh has multiple entries Ripuarian (ksh, Q3214), Q462 (ksh, Q462)
dar has multiple entries Dargwa (dar, Q3233), Dargin (dar, Q522263)
doi has multiple entries Dogri (doi, Q3273), Dogri–Kangri languages (doi, Q558334)
luy has multiple entries Bukusu (luy, Q3293), Luhya (luy, Q3589)
dz has multiple entries Dzongkha (dz, Q3308), Adap (dz, Q351240)
sgh has multiple entries Bartangi (sgh, Q3325), Shughni (sgh, Q3405)
bik has multiple entries Central Bikol (bik, Q3328), Bikol (bik, Q3545)
fil has multiple entries Filipino (fil, Q3329), Tagalog (fil, Q3405)
msn has multiple entries Mwesen (msn, Q333111), Vurës (msn, Q356385)
cr has multiple entries Cree (cr, Q3339), Woods Cree (cr, Q5630)
bfa has multiple entries Q333936 (bfa, Q333936), Q334658 (bfa, Q334658), Q341223 (bfa, Q341223), Bari (bfa, Q3504)
iba has multiple entries Iban (iba, Q3342), Balau (iba, Q485013)
mid has multiple entries Mandaic (mid, Q3350), Neo-Mandaic (mid, Q699174)
xal has multiple entries Kalmyk Oirat (xal, Q3363), Q356517 (xal, Q356517), Oirat (xal, Q5695)
phr has multiple entries Pahari-Potwari (phr, Q3373), Mirpur Punjabi (phr, Q687448)
oj has multiple entries Ojibwe (oj, Q3387), Eastern Ojibwa (oj, Q533034)
sc has multiple entries Sardinian (sc, Q3397), Q77797 (sc, Q77797)
prt has multiple entries Phray (prt, Q340147), Phai (prt, Q718018)
tmh has multiple entries Q3406 (tmh, Q3406), Tahoua (tmh, Q5639)
kv has multiple entries Komi-Zyryan (kv, Q3411), Komi (kv, Q3612)
yuf has multiple entries Yavapai (yuf, Q3420), Havasupai–Hualapai (yuf, Q356528)
kok has multiple entries Konkani (kok, Q3423), Maharashtrian Konkani (kok, Q673338)
ema has multiple entries Uokha (ema, Q344121), Ivbiosakon (ema, Q3542)
beb has multiple entries Bebele (beb, Q3497), Q3511 (beb, Q3511)
bmf has multiple entries Bom (bmf, Q3508), Krim (bmf, Q3571)
fa has multiple entries Iranian Persian (fa, Q351363), Persian (fa, Q916)
qu has multiple entries Cusco Quechua (qu, Q3514), Quechua (qu, Q521)
bjp has multiple entries Tangga (bjp, Q351511), Fanamaket (bjp, Q5670426)
az has multiple entries North Azerbaijani (az, Q351531), Azerbaijani (az, Q929)
ku has multiple entries Kurmanji (ku, Q3616), Kurdish (ku, Q3636)
ras has multiple entries Tegali (ras, Q3652), Q3696 (ras, Q3696)
din has multiple entries Southwestern Dinka (din, Q3654), Dinka (din, Q5646)
jrb has multiple entries Q3773 (jrb, Q3773), Judeo-Moroccan (jrb, Q5659)
chm has multiple entries Meadow Mari (chm, Q390661), Mari (chm, Q97368)
uth has multiple entries Hun-Saare (uth, Q391397), ut-Hun (uth, Q6331366)
tvd has multiple entries Q391443 (tvd, Q391443), Vadi (tvd, Q391493)
nbr has multiple entries Ningye (nbr, Q391534), Gbantu (nbr, Q552931)
waw has multiple entries Q391565 (waw, Q391565), Waiwai (waw, Q5663)
wnn has multiple entries Maykulan (wnn, Q391569), Wunumara (wnn, Q4700473)
snz has multiple entries Asas (snz, Q480363), Sinsauru (snz, Q752503)
dgl has multiple entries Kenuzi-Dongola (dgl, Q529599), Dongolawi (dgl, Q5521891)
mwr has multiple entries Q5631 (mwr, Q5631), Dhundari (mwr, Q63335)
bir has multiple entries Bikaru (bir, Q5634), Bisorio (bir, Q884474)
aog has multiple entries Angoram (aog, Q5636), Maramba (aog, Q675474)
pij has multiple entries Coyaima (pij, Q5645), Natagaimas (pij, Q696793), Pijao (pij, Q719351)
aas has multiple entries Aramanik (aas, Q5654), Asa (aas, Q5662)
mn has multiple entries Darkhad (mn, Q5655), Khalkha Mongolian (mn, Q639980), Mongolian (mn, Q924)
kln has multiple entries Kalenjin (kln, Q63722), Sabaot (kln, Q739589)
dmw has multiple entries Karranga (dmw, Q637334), Mudbura (dmw, Q693157)
kk-Latn has multiple entries Q6436299 (kk-Latn, Q6436299), Q9068128 (kk-Latn, Q9068128)
yue has multiple entries Yue Chinese (yue, Q703395), Cantonese (yue, Q918)
tyj has multiple entries Tai Do (tyj, Q767574), Tai Mène (tyj, Q767579)
tpo has multiple entries Tai Hang Tong (tpo, Q767575), Tai Pao (tpo, Q767579)
zh has multiple entries Chinese (zh, Q785), Mandarin Chinese (zh, Q919)
ukg has multiple entries Ukuriguma (ukg, Q787862), Ukwa (ukg, Q787863)
yrm has multiple entries Yir-Yoront (yrm, Q805381), Yirrk-Thangalkl (yrm, Q805382)
The duplicates codes can be listed with:
Cycle in [dng, zhx] from
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sinitic languages (Q3385), Sino-Tibetan [Other] (sit) (sit, Q4596), Dené–Caucasian (Q50796), Q1737690 (Q1737690)
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sinitic languages (Q3385), Sino-Tibetan [Other] (sit) (sit, Q4596), Sino-Austronesian (Q752474), Q2016217 (Q2016217)
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sinitic languages (Q3385), Sino-Tibetan [Other] (sit) (sit, Q4596), Q97690709 (Q9769070)
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sinitic languages (Q3385), Sino-Tibetan [Other] (sit) (sit, Q4596), Q9879175 (Q9879175)
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sino-Tibetan [Other] (sit) (sit, Q4596), Dené–Caucasian (Q50796), Q1737690 (Q1737690)
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sino-Tibetan [Other] (sit) (sit, Q4596), Sino-Austronesian (Q752474), Q2016217 (Q2016217)
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sino-Tibetan [Other] (sit) (sit, Q4596), Q97690709 (Q9769070)
Dungan (dng) (dng, Q3305), Q304877 (Q304877), Chinese (zh) (zh, Q919), Chinese (zh) (zh, Q785), Sino-Tibetan [Other] (sit) (sit, Q4596), Q9879175 (Q9879175)
Cycle in [map] from
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Austronesian [Other] (map) (map, Q4922), Austric (Q78378), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Austronesian [Other] (map) (map, Q4922), Austro-Tai (Q78381), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Austronesian [Other] (map) (map, Q4922), Sino-Austronesian (Q752474), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Formosan languages (fox) (fox, Q71527), Austronesian [Other] (map) (map, Q4922), Austric (Q78378), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Formosan languages (fox) (fox, Q71527), Austronesian [Other] (map) (map, Q4922), Austro-Tai (Q78381), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Formosan languages (fox) (fox, Q71527), Austronesian [Other] (map) (map, Q4922), Sino-Austronesian (Q752474), Q2016217 (Q2016217)
Cycle in [map, fox] from
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Austronesian [Other] (map) (map, Q4922), Austric (Q78378), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Austronesian [Other] (map) (map, Q4922), Austro-Tai (Q78381), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Austronesian [Other] (map) (map, Q4922), Sino-Austronesian (Q752474), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Formosan languages (fox) (fox, Q71527), Austronesian [Other] (map) (map, Q4922), Austric (Q78378), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Formosan languages (fox) (fox, Q71527), Austronesian [Other] (map) (map, Q4922), Austro-Tai (Q78381), Q2016217 (Q2016217)
Austronesian [Other] (map) (map, Q244999), Tsouic languages (Q71665), Formosan languages (fox) (fox, Q71527), Austronesian [Other] (map) (map, Q4922), Sino-Austronesian (Q752474), Q2016217 (Q2016217)