在Centos和Windows上,NLTK发送令牌生成器错误

时间:2018-03-11 05:54:34

标签: python-2.7 nltk

在这里学习本教程:http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize我试图对文本文件进行标记并得到这些错误:

这就是6.9

Traceback (most recent call last):
  File "sentProc000.py", line 5, in <module>
    sent_tokenize_list = sent_tokenize(text)
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/__init__.py", line 88, in sent_tokenize
    return tokenizer.tokenize(text)
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
    for el in it:
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)

和Windows 10上的这个:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 311, in _pair_iter
    for el in it:
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1291, in _slices_from_text
    if self.text_contains_sentbreak(context):
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1337, in text_contains_sentbreak
    for t in self._annotate_tokens(self._tokenize_words(text)):
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1472, in _annotate_second_pass
    for t1, t2 in _pair_iter(tokens):
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 577, in _annotate_first_pass
    for aug_tok in tokens:
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 542, in _tokenize_words
    for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)

使用python命令行命令手动尝试,如教程中所述,并运行基于相同教程的py脚本。

import nltk
from nltk.tokenize import sent_tokenize

with open('1', 'r') as content_file:
    text = content_file.read()

sent_tokenize_list = sent_tokenize(text)

thefile = open('result.txt', 'w')

for item in sent_tokenize_list:
  thefile.write("%s\n" % item)

更新:我设法找出问题所在的位置。它是UTF-8中的’字符显示为单引号字符。如何在不必手动编辑所有此类字符的情况下解决此问题?

1 个答案:

答案 0 :(得分:0)

问题确实是由解码/编码造成的。

以下是完整的工作脚本:

<mat-form-field style="width: 100px;">
  <input matInput [matAutocomplete]="auto">
  <mat-autocomplete style="width: 500px;" #auto="matAutocomplete">
    <mat-option *ngFor="let item of items" [value]="item">
    </mat-option>
  </mat-autocomplete>
</mat-form-field>