我试图解析字符串,以便将所有单词组件分开,即使是那些已经签约的单词组件。例如,"的标记化不应该是"#34;会是["应该"," n' t"]。
nltk模块似乎不能完成任务,但是:
"我不会那样做。"
标记为:
['我',"不会","' ve",'完成',&# 39;那','。']
所需的标记化"不会""是:['会'," n'","' ve"]
在检查了常见的英语收缩后,我正在尝试写一个正则表达式来完成这项工作,但我很难弄清楚如何匹配"' ve"只有一次。例如,以下令牌都可以终止收缩:
n&n;,' ve,' d,' ll,' s,' m,' re
但令牌"'"也可以遵循其他收缩,如:
' d&n;已经过,(可以想象)'
目前,我正试图纠缠这个正则表达式:
\ B [A-ZA-Z] +(:?(' d |' 11 | N' T)('阳离子))|(' S |' M |'再|'阳离子)\ b'/ p>
但是,这种模式也与形成不良的形式相匹配:
"不会告发'吨'已经'已经"
似乎问题在于,第三个撇号符合词边界的要求,以便最终的#34;" ve" token匹配整个正则表达式。
我一直无法想出区分单词边界和撇号的方法,如果不这样做,我愿意接受替代策略的建议。
另外,我很好奇是否有任何方法可以在字符类中包含单词boundary特殊字符。根据Python文档,字符类中的\ b匹配退格,并且似乎没有办法解决这个问题。
编辑:
这是输出:
>>>pattern = re.compile(r"\b[a-zA-Z]+(?:('d|'ll|n't)('ve)?)|('s|'m|'re|'ve)\b")
>>>matches = pattern.findall("She'll wish she hadn't've done that.")
>>>print matches
[("'ll", '', ''), ("n't", "'ve", ''), ('', '', "'ve")]
我无法弄清楚第三场比赛。特别是,我刚刚意识到,如果第三个撇号与前导\ b匹配,那么我就不知道什么是匹配字符类[a-zA-Z] +。
答案 0 :(得分:2)
(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
编辑:\ 2是匹配,\ 3是第一组,\ 4是第二组,\ 5是第三组。
答案 1 :(得分:2)
您可以使用以下完整的正则表达式:
import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."
print [i for i in pattern.split(s) if i]
结果:
['I', 'would', "n't", "'ve", 'done', 'that.']
答案 2 :(得分:1)
您可以使用此正则表达式来标记文本:
(?:(?!.')\w)+|\w?'\w+|[^\s\w]
用法:
>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']
答案 3 :(得分:1)
>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']
这样:
>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']
答案 4 :(得分:0)
这是一个简单的
text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
.replace("'s ", ' is ').replace("'m ", ' am ') \
.replace("'ll ", ' will ').replace("'d ", ' would ') \
.replace("'re ", ' are ').replace("'ve ", ' have ')