python标记化UnicodeDecodeError

时间:2016-05-18 03:48:39

标签: python nlp

我正在尝试对某些文档进行标记,但我有这个错误

  

UnicodeDecodeError:'ascii'编解码器无法解码字节0xef的位置   6:序数不在范围内(128)

import nltk
import pandas as pd

df = pd.DataFrame(pd.read_csv('status2.csv'))
documents = df['status']

result = [nltk.word_tokenize(sent) for sent in documents]

我认为这是unicode问题所以我添加了

documents = unicode(documents, 'utf-8')

另一个错误

  

TypeError:强制转换为Unicode:需要字符串或缓冲区,找到系列

print documents

1      Brandon Cachia ,All I know is that,you're so n...
2      Melissa Zejtunija:HAM AND CHEESE BIEX INI??? *...
3                         .........Where is my mind?????
4      Having a philosophical discussion with Trudy D...

1 个答案:

答案 0 :(得分:2)

unicode对字符串或字节进行操作,但documents是一个pandas系列。

也许:

result = [nltk.word_tokenize(unicode(sent, 'utf-8')) for sent in documents]