返回pandas单元格中每个单词的列表以及整个列中该单词的总计数

时间:2017-10-01 07:50:45

标签: python scikit-learn word-frequency countvectorizer

我有一个pandas数据框,df看起来像这样:

             column1
0   apple is a fruit
1        fruit sucks
2  apple tasty fruit
3   fruits what else
4      yup apple map
5   fire in the hole
6       that is true

我想生成一个column2,它是行中每个单词的列表,以及整个列中每个单词的总计数。所以输出会是这样的......

    column1            column2
0   apple is a fruit   [('apple', 3),('is', 2),('a', 1),('fruit', 3)]
1        fruit sucks   [('fruit', 3),('sucks', 1)]

我尝试使用sklearn,但未能实现上述目标。需要帮助。

from sklearn.feature_extraction.text import CountVectorizer
v = CountVectorizer()
x = v.fit_transform(df['text'])

2 个答案:

答案 0 :(得分:0)

这是提供所需结果的一种方法,尽管完全避免sklearn

def counts(data, column):
    full_list = []
    datr = data[column].tolist()
    total_words = " ".join(datr).split(' ')
    # per rows
    for i in range(len(datr)):
        #first per row get the words
        word_list = re.sub("[^\w]", " ",  datr[i]).split()
        #cycle per word
        total_row = []
        for word in word_list:
            count = []
            count = total_words.count(word)
            val = (word, count)
            total_row.append(val)
        full_list.append(total_row)
    return full_list

df['column2'] = counts(df,'column1')
df
         column1                                    column2
0   apple is a fruit  [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1        fruit sucks                   [(fruit, 3), (sucks, 1)]
2  apple tasty fruit       [(apple, 3), (tasty, 1), (fruit, 3)]
3   fruits what else        [(fruits, 1), (what, 1), (else, 1)]
4      yup apple map           [(yup, 1), (apple, 3), (map, 1)]
5   fire in the hole  [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6       that is true            [(that, 1), (is, 2), (true, 1)]

答案 1 :(得分:-1)

我不知道您是否可以使用scikit-learn执行此操作,但您可以编写一个函数,然后使用apply()将其应用于DataFrame或{{1 }}

以下是您如何为自己的榜样做准备:

Series

正如您所看到的,主要问题是test = pd.DataFrame(['apple is a fruit', 'fruit sucks', 'apple tasty fruit'], columns = ['A']) def a_function(row): splitted_row = str(row.values[0]).split() word_occurences = [] for word in splitted_row: column_occurences = test.A.str.count(word).sum() word_occurences.append((word, column_occurences)) return word_occurences test.apply(a_function, axis = 1) # Output 0 [(apple, 2), (is, 1), (a, 4), (fruit, 3)] 1 [(fruit, 3), (sucks, 1)] 2 [(apple, 2), (tasty, 1), (fruit, 3)] dtype: object 将计算test.A.str.count(word)的所有出现次数,只要分配给word的模式位于字符串中。这就是word显示为4次的原因。这可能很容易通过一些正则表达式来修复(我不太擅长)。

或者,如果您愿意丢失一些词语,可以在上面的函数中使用此变通方法:

"a"