We're updating the issue view to help you get more done. 

Improved French and Italian tokenization

Description

I couldn't easily find a Trac ticket for this issue, so I'm submitting this one.

It would be helpful if the attached prepositions for French and Italian words were detached.

Using regular expression notation for French an example way to detach the articles is with this expression. The break is introduced between regex groups.

1 ([^'\u2019]|qu)(['\u2019]+)([^'\u2019]+)

For Italian, the following seems to work.

1 ([^'\u2019]+)(['\u2019]+)([^'\u2019]*)

Environment

Status

Assignee

Andy Heninger

Reporter

George Rhoten

Labels

Time Needed

Days

tracCreated

Oct 23, 2013, 11:04 PM

tracOwner

andy

tracProject

all

tracReporter

grhoten

tracStatus

accepted

Components

Priority

medium