Question

pandas Dataframe中的一列包含文本信息，我想将它们作为一段文本放在一起，以供其他NLTK使用。

即

    book    lines
0   dracula The Project Gutenberg EBook of Dracula, by Br...
1   dracula \n
2   dracula This eBook is for the use of anyone anywhere a...
3   dracula almost no restrictions whatsoever. You may co...
4   dracula re-use it under the terms of the Project Guten...

跟随我的代码

list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['lines']) if i.lower() not in stop_words and i.isalpha()]

出现错误

Traceback (most recent call last):

File "<ipython-input-267-3bb703816dc6>", line 1, in <module>
list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['Injury_desc']) if i.lower() not in stop_words and i.isalpha()]

File "C:\Users\LIUX\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\regexp.py", line 131, in tokenize
return self._regexp.findall(text)

TypeError: expected string or bytes-like object

Answer 1

错误即将来临，因为您正在将数据帧传递给wordpunct_tokenize函数，该函数只需要字符串或类似字节的对象。

您需要遍历所有行，并将一行一行传递给wordpunct_tokenize。

list_of_words = []
for line in data['lines']:
    list_of_words.extend([i.lower() for i in wordpunct_tokenize(line) if i.lower() not in stop_words and i.isalpha()])

希望这会有所帮助。

如何将熊猫数据框转换为可用于NLTK的字符串或类似字节的对象

1 个答案: