根据列表从熊猫系列中删除停用词

时间:2020-11-04 19:11:16

标签: python pandas dataframe stop-words

我有以下称为句子的数据框

data = ["Home of the Jacksons"], ["Is it the real thing?"], ["What is it with you?"], [ "Tomatoes are the best"] [ "I think it's best to path ways now"]


sentences = pd.DataFrame(data, columns = ['sentence'])

还有一个称为停用词的数据框:

data = [["the"], ["it"], ["best"], [ "is"]]

stopwords = pd.DataFrame(data, columns = ['word'])

我想从句子[“句子”]中删除所有停用词。我尝试了下面的代码,但它不起作用。我认为if语句存在问题。有人可以帮忙吗?

Def remove_stopwords(input_string, stopwords_list): 
    stopwords_list = list(stopwords_list)
    my_string_split = input_string.split(' ')
    my_string = []
    for word in my_string_split: 
        if word not in stopwords_list: 
            my_string.append(word)
        my_string = " ".join(my_string)
        return my_string

sentence['cut_string']= sentence.apply(lambda row: remove_stopwords(row['sentence'], stopwords['word']), axis=1)

当我应用该函数时,它仅返回句子中的前几个字符串或前几个字符串,而根本不切出停用词。 Kinda卡在这里

3 个答案:

答案 0 :(得分:1)

您可以使用列表理解功能将停用词转换为列表,并从句子中删除这些词,

stopword_list = stopwords['word'].tolist()

sentences['filtered] = sentences['sentence'].apply(lambda x: ' '.join([i for i in x.split() if i not in stopword_list]))

你得到

0                 Home of Jacksons
1                   Is real thing?
2                   What with you?
3                     Tomatoes are
4    I think it's to path ways now

或者您可以将代码包装在一个函数中,

def remove_stopwords(input_string, stopwords_list):     
    my_string = []
    for word in input_string.split(): 
        if word not in stopwords_list: 
            my_string.append(word)

    return " ".join(my_string)

stopword_list = stopwords['word'].tolist()
sentences['sentence'].apply(lambda row: remove_stopwords(row, stopword_list))

答案 1 :(得分:1)

上面的代码中有很多语法错误。如果将停用词保留为列表(或集合)而不是DataFrame,则以下内容将起作用-

data = ["Home of the Jacksons", "Is it the real thing?", "What is it with you?", "Tomatoes are the best", "I think it's best to path ways now"]
sentences = pd.DataFrame(data, columns = ['sentence'])

stopwords = ["the", "it", "best", "is"]


sentences.sentence.str.split().apply(lambda x: " ".join([y for y in x if y not in stopwords]))

答案 2 :(得分:1)

成功的关键是将停用词列表转换为set():集合的查找时间为O(1),而列表的时间为O(N)。

stop_set = set(stopwords.word.tolist())
sentences.sentence.str.split()\
         .apply(lambda x: ' '.join(w for w in x if w not in stop_set))