我正在尝试标记化NLTK教科书上的可用文本(使用python 2.7),但是输出与预期不符。有什么我想念的吗?
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens; includes ], [
'''
nltk.regexp_tokenize(text, pattern)
Output:
[('', '', ''),
('A.', '', ''),
('', '-print', ''),
('', '', ''),
('', '', '.40'),
('', '', '')]
Expected:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']