我正在使用我尝试的代码重新发布之前提出的问题 我正在开发一个python NLTK标记程序。
我的输入文件是包含多行的Konkani(印度语)文本。 我想我需要对输入文件进行编码。 请帮助。
我的代码是 - 用于输入几个句子的文件
inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा.
आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.
代码 -
import nltk
file=open('kkn.txt')
t=file.read();
s=nltk.pos_tag(nltk.word_tokenize(t))
print(s)
输出中出现错误 -
>>>
Traceback (most recent call last):
File "G:/NLTK/inputKonkaniSentence.py", line 4, in <module>
t=file.read();
File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 21: character maps to <undefined>
>>>
答案 0 :(得分:0)
这种情况正在发生,因为您尝试使用的文件未使用CP1252编码。您正在使用的编码是您必须要弄清楚的。您必须在打开文件时指定编码。例如:
file = open(filename, encoding="utf8")
答案 1 :(得分:0)
执行代码时 - 按照建议
import nltk
import re
import time
file = open('kkn.txt', encoding="utf-8")
file.read();
print (file)
n=nltk.pos_tag(nltk.word_tokenize(file))
print(n)
file.close()
输出: - \
<_io.TextIOWrapper name='kkn.txt' mode='r' encoding='utf-8'> Traceback (most recent call last): File "G:\NLTK\try.py", line 10, in <module> n=nltk.pos_tag(nltk.word_tokenize(file)) File "C:\Python34\lib\site-packages\nltk\tokenize\__init__.py", line 101, in word_tokenize return [token for sent in sent_tokenize(text, language) File "C:\Python34\lib\site-packages\nltk\tokenize\__init__.py", line 86, in sent_tokenize return tokenizer.tokenize(text) File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1226, in tokenize return list(self.sentences_from_text(text, realign_boundaries)) File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1274, in sentences_from_text return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)] File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1265, in span_tokenize return [(sl.start, sl.stop) for sl in slices] File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1265, in <listcomp> return [(sl.start, sl.stop) for sl in slices] File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1304, in _realign_boundaries for sl1, sl2 in _pair_iter(slices): File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter prev = next(it) File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1278, in _slices_from_text for match in self._lang_vars.period_context_re().finditer(text): TypeError: expected string or buffer