应用错误收集

我正在尝试标记化NLTK教科书上的可用文本（使用python 2.7），但是输出与预期不符。有什么我想念的吗？

text = 'That U.S.A. poster-print costs $12.40...'

pattern = r'''(?x)     # set flag to allow verbose regexps
   ([A-Z]\.)+          # abbreviations, e.g. U.S.A.
   | \w+(-\w+)*        # words with optional internal hyphens
   | \$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
   '''

nltk.regexp_tokenize(text, pattern)


Output: 
 [('', '', ''),
 ('A.', '', ''),
 ('', '-print', ''),
 ('', '', ''),
 ('', '', '.40'),
 ('', '', '')]

Expected:
['That', 'U.S.A.', 'poster-print', 'costs', '$12.40', '...']

Python：NLTK-正则表达式令牌生成器产生空输出

0 个答案: