我尝试在python中使用nltk实现一个正则表达式标记生成器,但结果如下:
>>> import nltk
>>> text = 'That U.S.A. poster-print costs $12.40...'
>>> pattern = r'''(?x) # set flag to allow verbose regexps
... ([A-Z]\.)+ # abbreviations, e.g. U.S.A.
... | \w+(-\w+)* # words with optional internal hyphens
... | \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
... | \.\.\. # ellipsis
... | [][.,;"'?():-_`] # these are separate tokens; includes ], [
... '''
>>> nltk.regexp_tokenize(text, pattern)
[('', '', ''), ('', '', ''), ('', '-print', ''), ('', '', ''), ('', '', '')]
但想要的结果是:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']
为什么呢?错误在哪里?
答案 0 :(得分:10)
您应该将所有捕获组都转为非捕获:
([A-Z]\.)+
> (?:[A-Z]\.)+
\w+(-\w+)*
- > \w+(?:-\w+)*
\$?\d+(\.\d+)?%?
至\$?\d+(?:\.\d+)?%?
问题是regexp_tokenize
似乎正在使用re.findall
,当模式中定义了多个捕获组时,它会返回捕获元组列表。见this nltk.tokenize package reference:
pattern (str)
- 用于构建此标记生成器的模式。 (此模式不得包含捕获括号;请使用非捕获括号,例如(?:...),而不是
此外,我不确定您是否希望使用与包含全部大写字母的范围匹配的:-_
,将-
放在字符类的末尾。
因此,请使用
pattern = r'''(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(?:-\w+)* # words with optional internal hyphens
| \$?\d+(?:\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():_`-] # these are separate tokens; includes ], [
'''