按年份和名称,我希望计算从Excel导入的数据框中单词的出现次数,结果也会导出到Excel。
这是示例代码:
source = pd.DataFrame({'Name' : ['John', 'Mike', 'John','John'],
'Year' : ['1999', '2000', '2000','2000'],
'Message' : [
'I Love You','Will Remember You','Love','I Love You]})
数据框中的以下结果如下。有什么想法吗?
Year Name Message Count
1999 John I 1
1999 John love 1
1999 John you 1
2000 Mike Will 1
2000 Mike Remember 1
2000 Mike You 1
2000 John Love 2
2000 John I 1
2000 John You 1
答案 0 :(得分:2)
我认为您可以先split
列Message
,创建Serie
并将其添加到原始source
。最后groupby
与size
:
#split column Message to new df, create Serie by stack
s = (source.Message.str.split(expand=True).stack())
#remove multiindex
s.index = s.index.droplevel(-1)
s.name= 'Message'
print(s)
0 I
0 Love
0 You
1 Will
1 Remember
1 You
2 Love
3 I
3 Love
3 You
Name: Message, dtype: object
#remove old column Message
source = source.drop(['Message'], axis=1)
#join Serie s to df source
df = (source.join(s))
#aggregate size
print (df.groupby(['Year', 'Name', 'Message']).size().reset_index(name='count'))
Year Name Message count
0 1999 John I 1
1 1999 John Love 1
2 1999 John You 1
3 2000 John I 1
4 2000 John Love 2
5 2000 John You 1
6 2000 Mike Remember 1
7 2000 Mike Will 1
8 2000 Mike You 1