Unicode标记Python NLTK中的输入文件

时间:2015-05-31 10:51:46

标签: python nltk python-3.4

我正在使用我尝试的代码重新发布之前提出的问题 我正在开发一个python NLTK标记程序。

我的输入文件是包含多行的Konkani(印度语)文本。 我想我需要对输入文件进行编码。 请帮助。

我的代码是 - 用于输入几个句子的文件

inputfile - ताजो स्वास आनी चकचकीत दांत तुमचें व्यक्तीमत्व परजळायतात.
दांत आशिल्ल्यान तुमचो आत्मविश्वासय वाडटा.
आमच्या हड्ड्यां आनी दांतां मदीं बॅक्टेरिया आसतात.

代码 -

import nltk

file=open('kkn.txt')
t=file.read();
s=nltk.pos_tag(nltk.word_tokenize(t))

print(s)

输出中出现错误 -

>>> 
Traceback (most recent call last):
  File "G:/NLTK/inputKonkaniSentence.py", line 4, in <module>
    t=file.read();
  File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 21: character maps to <undefined>
>>> 

2 个答案:

答案 0 :(得分:0)

这种情况正在发生,因为您尝试使用的文件未使用CP1252编码。您正在使用的编码是您必须要弄清楚的。您必须在打开文件时指定编码。例如:

file = open(filename, encoding="utf8")

答案 1 :(得分:0)

执行代码时 - 按照建议

import nltk
import re
import time

file = open('kkn.txt', encoding="utf-8")
file.read();
print (file)

n=nltk.pos_tag(nltk.word_tokenize(file))
print(n)

file.close()

输出: - \

  
    
<_io.TextIOWrapper name='kkn.txt' mode='r' encoding='utf-8'>
Traceback (most recent call last):
  File "G:\NLTK\try.py", line 10, in <module>
    n=nltk.pos_tag(nltk.word_tokenize(file))
  File "C:\Python34\lib\site-packages\nltk\tokenize\__init__.py", line 101, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "C:\Python34\lib\site-packages\nltk\tokenize\__init__.py", line 86, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1226, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1274, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1265, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1265, in <listcomp>
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1304, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 310, in _pair_iter
    prev = next(it)
  File "C:\Python34\lib\site-packages\nltk\tokenize\punkt.py", line 1278, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer