UnicodeDecodeError:' utf8'编解码器不能解码位置82中的字节0xa0:无效的起始字节

时间:2016-03-09 11:04:59

标签: python utf-8 nltk

代码:

main_df = pd.DataFrame()
for i in range(len(data)):
    print i
    df = pd.DataFrame()
    input_text = data['Input'][i].decode('latin-1', 'replace')
    input_text = re.sub(r'\\n',' ',input_text)
    pro_text = sub_fullstop(input_text)
    tok = token.tokenize(pro_text)
    tag = pdf_ner.tag(tok)

错误:

Traceback (most recent call last):

File "<ipython-input-77-1a4b40a73374>", line 1, in <module>
runfile('C:/Users/PurimetlaK.old/Documents/rohan/GPP_PDF_data preparation.py', wdir='C:/Users/PurimetlaK.old/Documents/rohan')

File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)

File "C:/Users/PurimetlaK.old/Documents/rohan/GPP_PDF_data preparation.py", line 53, in <module>
tag = pdf_ner.tag(tok)

File "C:\Anaconda\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]

File "C:\Anaconda\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)

File "C:\Anaconda\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)

UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 82: invalid start byte

0 个答案:

没有答案