我有两个数据框:
In [6]: df1 = pd.DataFrame({'word':['laugh','smile','frown','cry'],'score':[7,2,-3,-8]}, columns=['word','score'])
df1
Out[6]: word score
0 laugh 7
1 smile 2
2 frown -3
3 cry -8
In [8]: df2 = pd.DataFrame({'word':['frown','laugh','play']})
df2
Out[8]:
word
0 frown
1 laugh
2 play
我知道我可以将它们合并在一起并获得每个单词的分数:
In [10]: pd.merge(df1,df2)
Out[10]: word score
0 laugh 7
1 frown -3
但是,我无法完全理解如何:
i)自动为没有分数的单词分配零分。因此,“play”在df2中,但在合并后被删除,但我想在合并后将其保留在结果中。我希望df2包含许多没有分数的单词,所以不可能简单地将这些单词添加到df1并将它们指定为零。所以,我希望合并代替:
Out[10]: word score
0 laugh 7
1 frown -3
2 play 0
ii)我现在如何获得多个单词的平均分数。所以,如果我的数据框看起来像这样:
In [14]: df3 = pd.DataFrame({'words':['frown cry','laugh smile','play laugh', 'cry laugh play smile']})
df3
Out[14]: words
0 frown cry
1 laugh smile
2 play laugh
3 cry laugh play smile
我希望能够与df1交叉引用df3来获取:
Out[14]: words average_score
0 frown cry -5.5
1 laugh smile 4.5
2 play laugh 3.5
3 cry laugh play smile 0.25
希望我做的数学合适!我猜在Pandas中可能还有其他/更好的方法吗?
答案 0 :(得分:1)
对于(i)您只需要指定right
join,并填充空值:
>>> pd.merge(df1, df2, how='right').fillna(0)
word score
0 laugh 7
1 frown -3
2 play 0
(ii)你可以这样做:
>>> def grpavg(ws):
... i = df1['word'].isin(ws)
... return df1.loc[i, 'score'].sum() / len(ws)
...
>>> df3['avg-score'] = df3['words'].str.split().map(grpavg)
>>> df3
words avg-score
0 frown cry -5.50
1 laugh smile 4.50
2 play laugh 3.50
3 cry laugh play smile 0.25
编辑:回答评论,明确传递框架,然后使用lambda
或functools.partial
进行绑定:
>>> def grpavg(ws, df):
... i = df['word'].isin(ws)
... return df.loc[i, 'score'].sum() / len(ws)
...
>>> from functools import partial
>>> f = partial(grpavg, df=df1)
>>> df3['avg-score'] = df3['words'].str.split().map(f)
>>> df3
words avg-score
0 frown cry -5.50
1 laugh smile 4.50
2 play laugh 3.50
3 cry laugh play smile 0.25