Question

我有一个数据框，其中一列是每个实例的一个句子。我想采取每个实例，取出停用词，并将其作为字符串端到端地放置。任何蟒蛇/熊猫的想法？

感谢所有回复的SQL人员 - 我知道我需要学习sql。现在，我只是在寻找一个python / pandas / nltk解决方案。

Answer 1

您可以将nltk用于预先存在的一组停用词，然后执行逐列操作，然后使用sum连接字符串。

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


st_words = set(stopwords.words('english'))
dfrm = pd.DataFrame({
    'sentence_index': range(3),
    'sentence': ['the first sentence in the list',
                 'a second sentence in the list',
                 'this is the third sentence']
})
one_big_sentence = dfrm['sentence'].map(
    lambda s: ' '.join(
        [w for w in word_tokenize(s) if w not in st_words]
    ) + ' '
).sum()
print(one_big_sentence)

请注意，您需要提前执行nltk.download()并下载stopwords语料库和punkt令牌化程序模型，以便上述代码正常运行。否则，您可能会收到与nltk无法找到必要数据相关的错误。

当我运行上面的程序时，这是输出：

$ python sentence.py 
first sentence list second sentence list third sentence

Pandas将实例列入一个大清单

1 个答案: