我使用的代码似乎太慢了,也许还有一种选择。
在Pandas中,我标记了数据帧列'description'并列出了要删除的停用词+标点符号列表,然后尝试删除了无用的单词。
import numpy as np
import pandas as pd
import nltk
import string
nltk.download("stopwords")
nltk.download('punkt')
df2 = pd.read_csv('xxx')
清洗后等,最终大约有135.000行且没有空值
description points price
0 This tremendous 100% varietal wine hails from ... 96 235.0
1 Ripe aromas of fig, blackberry and cassis are ... 96 110.0
2 Mac Watson honors the memory of a wine once ma... 96 90.0
3 This spent 20 months in 30% new French oak, an... 96 65.0
4 This is the top wine from La Bégude, named aft... 95 66.0
然后标记化
df2['description'] = df2.apply(lambda row:
nltk.word_tokenize(row['description']), axis=1)
df2.head()
tokenize非常快。现在定义了无用的单词:
useless_words = nltk.corpus.stopwords.words("english") +
list(string.punctuation)
现在尝试使用相同的技巧从df2['description']
df2['description'] = df2.apply(lambda row: [word for word in
df2['description'] if not word in useless_words], axis=1)
我希望这会更快,但是计算需要时间。我是编码的新手,所以以为也许您知道一种替代方法来处理此问题并减少计算时间。也许我也不正确,我不知道,所以我事先询问并感谢。
答案 0 :(得分:1)
您尝试过吗?
df2["description"] = df2["description"].str.lower()
df2["description"] = df2["description"].str.replace("|".join(useless_words), "")