Python Pandas查找/交叉引用

时间:2014-09-06 12:44:06

标签: python-2.7 pandas merge ipython

我有两个数据框:

In [6]: df1 = pd.DataFrame({'word':['laugh','smile','frown','cry'],'score':[7,2,-3,-8]}, columns=['word','score'])
        df1

Out[6]:     word    score
        0   laugh   7
        1   smile   2
        2   frown   -3
        3   cry -8

In [8]: df2 = pd.DataFrame({'word':['frown','laugh','play']})
        df2

Out[8]:
            word
        0   frown
        1   laugh
        2   play

我知道我可以将它们合并在一起并获得每个单词的分数:

In [10]: pd.merge(df1,df2)

Out[10]:    word    score
         0  laugh   7
         1  frown   -3

但是,我无法完全理解如何:

i)自动为没有分数的单词分配零分。因此,“play”在df2中,但在合并后被删除,但我想在合并后将其保留在结果中。我希望df2包含许多没有分数的单词,所以不可能简单地将这些单词添加到df1并将它们指定为零。所以,我希望合并代替:

Out[10]:    word    score
         0  laugh   7
         1  frown   -3
         2  play    0

ii)我现在如何获得多个单词的平均分数。所以,如果我的数据框看起来像这样:

In [14]: df3 = pd.DataFrame({'words':['frown cry','laugh smile','play laugh', 'cry laugh play smile']})
         df3

Out[14]:    words
        0   frown cry
        1   laugh smile
        2   play laugh
        3   cry laugh play smile

我希望能够与df1交叉引用df3来获取:

Out[14]:    words                 average_score
        0   frown cry              -5.5
        1   laugh smile            4.5
        2   play laugh             3.5
        3   cry laugh play smile   0.25

希望我做的数学合适!我猜在Pandas中可能还有其他/更好的方法吗?

1 个答案:

答案 0 :(得分:1)

对于(i)您只需要指定right join,并填充空值:

>>> pd.merge(df1, df2, how='right').fillna(0)
    word  score
0  laugh      7
1  frown     -3
2   play      0

(ii)你可以这样做:

>>> def grpavg(ws):
...     i = df1['word'].isin(ws)
...     return df1.loc[i, 'score'].sum() / len(ws)
... 
>>> df3['avg-score'] = df3['words'].str.split().map(grpavg)
>>> df3
                  words  avg-score
0             frown cry      -5.50
1           laugh smile       4.50
2            play laugh       3.50
3  cry laugh play smile       0.25

编辑:回答评论,明确传递框架,然后使用lambdafunctools.partial进行绑定:

>>> def grpavg(ws, df):
...     i = df['word'].isin(ws)
...     return df.loc[i, 'score'].sum() / len(ws)
... 
>>> from functools import partial
>>> f = partial(grpavg, df=df1)
>>> df3['avg-score'] = df3['words'].str.split().map(f)
>>> df3
                  words  avg-score
0             frown cry      -5.50
1           laugh smile       4.50
2            play laugh       3.50
3  cry laugh play smile       0.25