我有一个pandas DataFrame,如下所示
key message Final Category
0 1 I have not received my gifts which I ordered ok voucher
1 2 hth her wells idyll McGill kooky bbc.co noclass
2 3 test test test 1 test noclass
3 4 test noclass
4 5 hello where is my reward points other
5 6 hi, can you get koovs coupons or vouchers here options
6 7 Hi Hey when you people will include amazon an options
我想获得一个{key:{key:value},..}类型的数据结构,其中第一个groupby为Final Category,对于每个类别我都有一个字典,每个单词都有频率。 例如 我可以将所有类似于{'noclass'的noclass分组:{'test':5,'1':1,'hth':1,'她':1 ....},}
我是SOF的新手,很抱歉写得不好。 谢谢
答案 0 :(得分:0)
这可能是一种更有说服力的方法,但这里有一堆嵌套的for循环:
final_cat_list = df['Final Category'].unique()
word_count = {}
for f in final_cat_list:
word_count[f] = {}
message_list = list(df.loc[df['Final Category'] == f, 'key message'])
for m in message_list:
word_list = m.split(" ")
for w in word_list:
if w in word_count[f]:
word_count[f][w] += 1
else:
word_count[f][w] = 1
答案 1 :(得分:0)
这会修改原始df,所以你可能想先复制它
from collections import Counter
df["message"] = df["message"].apply(lambda message: message + " ")
df.groupby(["Final Category"]).sum().applymap(lambda message: Counter(message.split()))
此代码的作用:首先它在所有消息的末尾添加一个空格。这将在稍后出现。 然后按最终类别分组,并汇总每个组中的消息。这是尾随空格很重要的地方,否则消息的最后一个字会粘在下一个字的第一个字上。 (求和是字符串的连接)
然后沿着空格分割字符串以获取单词,然后计算。
答案 2 :(得分:0)
import pandas as pd
import numpy as np
# copy/paste data (you can skip this since you already have a dataframe)
dict = {0 : {'key': 1 , 'message': "I have not received my gifts which I ordered ok", 'Final Category': 'voucher'},
1 : {'key': 2 , 'message': "hth her wells idyll McGill kooky bbc.co", 'Final Category': 'noclass'},
2 : {'key': 3 , 'message': "test test test 1 test", 'Final Category': 'noclass'},
3 : {'key': 4 , 'message': "test", 'Final Category': 'noclass'},
4 : {'key': 5 , 'message': "hello where is my reward points", 'Final Category': 'other'},
5 : {'key': 6 , 'message': "hi, can you get koovs coupons or vouchers here", 'Final Category': 'options'},
6 : {'key': 7 , 'message': "Hi Hey when you people will include amazon an", 'Final Category': 'options'}
}
# make DataFrame (you already have one)
df = pd.DataFrame(dict).T
# break up text into words, combine by 'Final' in my case
df.message = df.message.str.split(' ')
final_df = df.groupby('Final Category').agg(np.sum)
# make final dictionary
final_dict = {}
for label,text in zip(final_df.index, final_df.message):
final_dict[label] = {w: text.count(w) for w in text}