代码:
main_df = pd.DataFrame()
for i in range(len(data)):
print i
df = pd.DataFrame()
input_text = data['Input'][i].decode('latin-1', 'replace')
input_text = re.sub(r'\\n',' ',input_text)
pro_text = sub_fullstop(input_text)
tok = token.tokenize(pro_text)
tag = pdf_ner.tag(tok)
错误:
Traceback (most recent call last):
File "<ipython-input-77-1a4b40a73374>", line 1, in <module>
runfile('C:/Users/PurimetlaK.old/Documents/rohan/GPP_PDF_data preparation.py', wdir='C:/Users/PurimetlaK.old/Documents/rohan')
File "C:\Anaconda\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 580, in runfile
execfile(filename, namespace)
File "C:/Users/PurimetlaK.old/Documents/rohan/GPP_PDF_data preparation.py", line 53, in <module>
tag = pdf_ner.tag(tok)
File "C:\Anaconda\lib\site-packages\nltk\tag\stanford.py", line 59, in tag
return self.tag_sents([tokens])[0]
File "C:\Anaconda\lib\site-packages\nltk\tag\stanford.py", line 82, in tag_sents
stanpos_output = stanpos_output.decode(encoding)
File "C:\Anaconda\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 82: invalid start byte