Python正则表达式:标记英语收缩

时间:2015-01-20 20:13:50

标签: python regex pattern-matching nlp

我试图解析字符串,以便将所有单词组件分开,即使是那些已经签约的单词组件。例如,"的标记化不应该是"#34;会是["应该"," n' t"]。

nltk模块似乎不能完成任务,但是:

  

"我不会那样做。"

标记为:

  

['我',"不会","' ve",'完成',&# 39;那','。']

所需的标记化"不会""是:['会'," n'","' ve"]

在检查了常见的英语收缩后,我正在尝试写一个正则表达式来完成这项工作,但我很难弄清楚如何匹配"' ve"只有一次。例如,以下令牌都可以终止收缩:

  

n&n;,' ve,' d,' ll,' s,' m,' re

但令牌"'"也可以遵循其他收缩,如:

  

' d&n;已经过,(可以想象)'

目前,我正试图纠缠这个正则表达式:

  

\ B [A-ZA-Z] +(:?(' d |' 11 | N' T)('阳离子))|(' S |' M |'再|'阳离子)\ b'/ p>

但是,这种模式也与形成不良的形式相匹配:

  

"不会告发'吨'已经'已经"

似乎问题在于,第三个撇号符合词边界的要求,以便最终的#34;" ve" token匹配整个正则表达式。

我一直无法想出区分单词边界和撇号的方法,如果不这样做,我愿意接受替代策略的建议。

另外,我很好奇是否有任何方法可以在字符类中包含单词boundary特殊字符。根据Python文档,字符类中的\ b匹配退格,并且似乎没有办法解决这个问题。

编辑:

这是输出:

>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]

我无法弄清楚第三场比赛。特别是,我刚刚意识到,如果第三个撇号与前导\ b匹配,那么我就不知道什么是匹配字符类[a-zA-Z] +。

5 个答案:

答案 0 :(得分:2)

(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])

编辑:\ 2是匹配,\ 3是第一组,\ 4是第二组,\ 5是第三组。

答案 1 :(得分:2)

您可以使用以下完整的正则表达式:

import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."

print [i for i in pattern.split(s) if i]

结果:

['I', 'would', "n't", "'ve", 'done', 'that.']

答案 2 :(得分:1)

您可以使用此正则表达式来标记文本:

(?:(?!.')\w)+|\w?'\w+|[^\s\w]

用法:

>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']

答案 3 :(得分:1)

>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']

这样:

>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']

答案 4 :(得分:0)

这是一个简单的

text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
    .replace("'s ", ' is ').replace("'m ", ' am ') \
    .replace("'ll ", ' will ').replace("'d ", ' would ') \
    .replace("'re ", ' are ').replace("'ve ", ' have ')