如何将熊猫数据框转换为可用于NLTK的字符串或类似字节的对象

时间:2018-10-04 10:59:14

标签: python pandas nltk

pandas Dataframe中的一列包含文本信息,我想将它们作为一段文本放在一起,以供其他NLTK使用。

    book    lines
0   dracula The Project Gutenberg EBook of Dracula, by Br...
1   dracula \n
2   dracula This eBook is for the use of anyone anywhere a...
3   dracula almost no restrictions whatsoever. You may co...
4   dracula re-use it under the terms of the Project Guten...

跟随我的代码

list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['lines']) if i.lower() not in stop_words and i.isalpha()]

出现错误

Traceback (most recent call last):

File "<ipython-input-267-3bb703816dc6>", line 1, in <module>
list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['Injury_desc']) if i.lower() not in stop_words and i.isalpha()]

File "C:\Users\LIUX\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\regexp.py", line 131, in tokenize
return self._regexp.findall(text)

TypeError: expected string or bytes-like object

1 个答案:

答案 0 :(得分:1)

错误即将来临,因为您正在将数据帧传递给wordpunct_tokenize函数,该函数只需要字符串或类似字节的对象。

您需要遍历所有行,并将一行一行传递给wordpunct_tokenize。

list_of_words = []
for line in data['lines']:
    list_of_words.extend([i.lower() for i in wordpunct_tokenize(line) if i.lower() not in stop_words and i.isalpha()])

希望这会有所帮助。