Python Pandas:由另一个合并或过滤DataFrame。有没有更好的办法?

时间:2015-08-10 17:23:27

标签: python pandas merge

我有时遇到的一种情况是,我有两个数据帧(df1df2),我想根据多个列之间的多个列的交集创建一个新的数据帧(df3df1df2

例如,我想通过按df3df1列过滤Campaign来创建Group

import pandas as pd
df1 = pd.DataFrame({'Campaign':['Campaign 1', 'Campaign 2', 'Campaign 3', 'Campaign 3', 'Campaign 4'], 'Group':['Some group', 'Arbitrary Group', 'Group 1', 'Group 2', 'Done Group'], 'Metric':[245,91,292,373,32]}, columns=['Campaign', 'Group', 'Metric'])
df2 = pd.DataFrame({'Campaign':['Campaign 3', 'Campaign 3'], 'Group':['Group 1', 'Group 2'], 'Metric':[23, 456]}, columns=['Campaign', 'Group', 'Metric'])

df1

     Campaign            Group  Metric
0  Campaign 1       Some group     245
1  Campaign 2  Arbitrary Group      91
2  Campaign 3          Group 1     292
3  Campaign 3          Group 2     373
4  Campaign 4       Done Group      32

df2

     Campaign    Group  Metric
0  Campaign 3  Group 1      23
1  Campaign 3  Group 2     456

我知道我可以通过合并 ...

来做到这一点
df3 = df1.merge(df2, how='inner', on=['Campaign', 'Group'], suffixes=('','_del'))
#df3
     Campaign    Group  Metric  Metric_del
0  Campaign 3  Group 1     292          23
1  Campaign 3  Group 2     373         456

但是我必须弄清楚以drop结尾的_del列的方法。我猜这个:

df3.select(lambda x: not re.search('_del', x), axis=1)
##The result I'm going for but required merge, then select (2-steps)
     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373

问题

我最感兴趣的是返回df1,只是根据df2的{​​{1}}值进行过滤。

  1. 是否有更好方式返回Campaign|Group而不诉诸df1

  2. 有没有办法merge,但不会将merge的列返回到df2,只返回merge的列?

2 个答案:

答案 0 :(得分:2)

假设您的df1df2具有完全相同的列。您可以先将这些连接键列设置为索引,然后使用df1.reindex(df2.index)和另一个.dropna()来生成交集。

df3 = df1.set_index(['Campaign', 'Group'])
df4 = df2.set_index(['Campaign', 'Group'])
# reindex first and dropna will produce the intersection
df3.reindex(df4.index).dropna(how='all').reset_index()

     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373

编辑:

当密钥不唯一时使用.isin

# create some duplicated keys and values
df3 = df3.append(df3)
df4 = df4.append(df4)

# isin
df3[df3.index.isin(df4.index)].reset_index()

     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373
2  Campaign 3  Group 1     292
3  Campaign 3  Group 2     373

答案 1 :(得分:0)

或者,您可以使用 groupbyfilter,如下所示:

# Compute the set of values you're interested in.
# In your example, this will be {('Campaign 3', 'Group 1'), ('Campaign 3', 'Group 2')}
interesting_groups = set(df2[['Campaign', 'Group']].apply(tuple, axis=1))
# Filter df1, keeping only values in that set
result = df1.groupby(['Campaign', 'Group']).filter(
    lambda x: x.name in interesting_groups
)

另一个例子参见 filter docs