我有以下示例数据框:
No category problem_definition
175 2521 ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438 ['galley', 'work', 'table', 'stuck']
912 2698 ['cloth', 'stuck']
572 2521 ['stuck', 'coffee']
problem_definition字段已被标记化,并删除了终止间隔词。
我想创建一个频率分布,以输出另一个熊猫数据帧:
1)带有问题定义中每个单词的出现频率 2)按类别字段在问题定义中每个单词的出现频率
在下面的情况1中采样所需的输出:
text count
coffee 2
maker 1
brewing 1
properly 1
2 1
420 3
stuck 3
galley 1
work 1
table 1
cloth 1
在下面的情况2中采样所需的输出:
category text count
2521 coffee 2
2521 maker 1
2521 brewing 1
2521 properly 1
2521 2 1
2521 420 3
2521 stuck 1
1438 galley 1
1438 work 1
1438 table 1
1438 stuck 1
2698 cloth 1
2698 stuck 1
我尝试了以下代码来完成1):
from nltk.probability import FreqDist
import pandas as pd
fdist = FreqDist(df['problem_definition_stopwords'])
TypeError:不可散列的类型:“列表”
我不知道该怎么做2)
答案 0 :(得分:0)
我使用unnesting逐步介绍了解决此类问题的几种方法,为了娱乐,我仅在此处链接question
unnesting(df,['problem_definition'])
Out[288]:
problem_definition No category
0 coffee 175 2521
0 maker 175 2521
0 brewing 175 2521
0 properly 175 2521
0 2 175 2521
0 420 175 2521
0 420 175 2521
0 420 175 2521
1 galley 211 1438
1 work 211 1438
1 table 211 1438
1 stuck 211 1438
2 cloth 912 2698
2 stuck 912 2698
3 stuck 572 2521
3 coffee 572 2521
然后仅对情况2进行常规groupby
+ size
unnesting(df,['problem_definition']).groupby(['category','problem_definition']).size()
Out[290]:
category problem_definition
1438 galley 1
stuck 1
table 1
work 1
2521 2 1
420 3
brewing 1
coffee 2
maker 1
properly 1
stuck 1
2698 cloth 1
stuck 1
dtype: int64
关于案例1 value_counts
unnesting(df,['problem_definition'])['problem_definition'].value_counts()
Out[291]:
stuck 3
420 3
coffee 2
table 1
maker 1
2 1
brewing 1
galley 1
work 1
cloth 1
properly 1
Name: problem_definition, dtype: int64
我自己定义的功能
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
答案 1 :(得分:0)
您还可以按类别扩展列表,然后执行groupby
和size
。
import pandas as pd
import numpy as np
df = pd.DataFrame( {'No':[175,572],
'category':[2521,2521],
'problem_definition': [['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
['stuck', 'coffee']]} )
c = df.groupby('category')['problem_definition'].agg('sum').reset_index()
lst_col = 'problem_definition'
c = pd.DataFrame({
col:np.repeat(c[col].values, c[lst_col].str.len())
for col in c.columns.drop(lst_col)}
).assign(**{lst_col:np.concatenate(c[lst_col].values)})[c.columns]
c.groupby(['category','problem_definition']).size()
>>
category problem_definition
2521 2 1
420 3
brewing 1
coffee 2
maker 1
properly 1
stuck 1
dtype: int64
或者您也可以使用计数器来帮助您存储按category
分组的计数值:
import pandas as pd
import numpy as np
from collections import Counter
df = pd.DataFrame( {'No':[175,572],
'category':[2521,2521],
'problem_definition': [['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
['stuck', 'coffee']]} )
c = df.groupby('category')['problem_definition'].agg('sum').reset_index()
c['problem_definition'] = c['problem_definition'].apply(lambda x: Counter(x).items())
lst_col = 'problem_definition'
s = pd.DataFrame({
col:np.repeat(c[col].values, c[lst_col].str.len())
for col in c.columns.drop(lst_col)}
).assign(**{'text':np.concatenate(c[lst_col].apply(lambda x: [k for (k,v) in x]))}
).assign(**{'count':np.concatenate(c[lst_col].apply(lambda x: [v for (k,v) in x]))} )
s