pandas Dataframe中的一列包含文本信息,我想将它们作为一段文本放在一起,以供其他NLTK使用。
即
book lines
0 dracula The Project Gutenberg EBook of Dracula, by Br...
1 dracula \n
2 dracula This eBook is for the use of anyone anywhere a...
3 dracula almost no restrictions whatsoever. You may co...
4 dracula re-use it under the terms of the Project Guten...
跟随我的代码
list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['lines']) if i.lower() not in stop_words and i.isalpha()]
出现错误
Traceback (most recent call last):
File "<ipython-input-267-3bb703816dc6>", line 1, in <module>
list_of_words = [i.lower() for i in wordpunct_tokenize(data[0]['Injury_desc']) if i.lower() not in stop_words and i.isalpha()]
File "C:\Users\LIUX\AppData\Local\Continuum\anaconda3\lib\site-packages\nltk\tokenize\regexp.py", line 131, in tokenize
return self._regexp.findall(text)
TypeError: expected string or bytes-like object
答案 0 :(得分:1)
错误即将来临,因为您正在将数据帧传递给wordpunct_tokenize函数,该函数只需要字符串或类似字节的对象。
您需要遍历所有行,并将一行一行传递给wordpunct_tokenize。
list_of_words = []
for line in data['lines']:
list_of_words.extend([i.lower() for i in wordpunct_tokenize(line) if i.lower() not in stop_words and i.isalpha()])
希望这会有所帮助。