在这里学习本教程:http://textminingonline.com/dive-into-nltk-part-ii-sentence-tokenize-and-word-tokenize我试图对文本文件进行标记并得到这些错误:
这就是6.9
Traceback (most recent call last):
File "sentProc000.py", line 5, in <module>
sent_tokenize_list = sent_tokenize(text)
File "/usr/lib/python2.6/site-packages/nltk/tokenize/__init__.py", line 88, in sent_tokenize
return tokenizer.tokenize(text)
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1226, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1274, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1265, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1304, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 311, in _pair_iter
for el in it:
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1280, in _slices_from_text
if self.text_contains_sentbreak(context):
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1325, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 1460, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 310, in _pair_iter
prev = next(it)
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 577, in _annotate_first_pass
for aug_tok in tokens:
File "/usr/lib/python2.6/site-packages/nltk/tokenize/punkt.py", line 542, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)
和Windows 10上的这个:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 94, in sent_tokenize
return tokenizer.tokenize(text)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1237, in tokenize
return list(self.sentences_from_text(text, realign_boundaries))
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1285, in sentences_from_text
return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1276, in span_tokenize
return [(sl.start, sl.stop) for sl in slices]
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1316, in _realign_boundaries
for sl1, sl2 in _pair_iter(slices):
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 311, in _pair_iter
for el in it:
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1291, in _slices_from_text
if self.text_contains_sentbreak(context):
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1337, in text_contains_sentbreak
for t in self._annotate_tokens(self._tokenize_words(text)):
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1472, in _annotate_second_pass
for t1, t2 in _pair_iter(tokens):
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
prev = next(it)
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 577, in _annotate_first_pass
for aug_tok in tokens:
File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 542, in _tokenize_words
for line in plaintext.split('\n'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 7: ordinal not in range(128)
使用python命令行命令手动尝试,如教程中所述,并运行基于相同教程的py脚本。
import nltk
from nltk.tokenize import sent_tokenize
with open('1', 'r') as content_file:
text = content_file.read()
sent_tokenize_list = sent_tokenize(text)
thefile = open('result.txt', 'w')
for item in sent_tokenize_list:
thefile.write("%s\n" % item)
更新:我设法找出问题所在的位置。它是UTF-8中的’
字符显示为单引号字符。如何在不必手动编辑所有此类字符的情况下解决此问题?
答案 0 :(得分:0)
问题确实是由解码/编码造成的。
以下是完整的工作脚本:
<mat-form-field style="width: 100px;">
<input matInput [matAutocomplete]="auto">
<mat-autocomplete style="width: 500px;" #auto="matAutocomplete">
<mat-option *ngFor="let item of items" [value]="item">
</mat-option>
</mat-autocomplete>
</mat-form-field>