我有时遇到的一种情况是,我有两个数据帧(df1
,df2
),我想根据多个列之间的多个列的交集创建一个新的数据帧(df3
) df1
和df2
。
例如,我想通过按df3
和df1
列过滤Campaign
来创建Group
。
import pandas as pd
df1 = pd.DataFrame({'Campaign':['Campaign 1', 'Campaign 2', 'Campaign 3', 'Campaign 3', 'Campaign 4'], 'Group':['Some group', 'Arbitrary Group', 'Group 1', 'Group 2', 'Done Group'], 'Metric':[245,91,292,373,32]}, columns=['Campaign', 'Group', 'Metric'])
df2 = pd.DataFrame({'Campaign':['Campaign 3', 'Campaign 3'], 'Group':['Group 1', 'Group 2'], 'Metric':[23, 456]}, columns=['Campaign', 'Group', 'Metric'])
df1
Campaign Group Metric
0 Campaign 1 Some group 245
1 Campaign 2 Arbitrary Group 91
2 Campaign 3 Group 1 292
3 Campaign 3 Group 2 373
4 Campaign 4 Done Group 32
df2
Campaign Group Metric
0 Campaign 3 Group 1 23
1 Campaign 3 Group 2 456
我知道我可以通过合并 ...
来做到这一点df3 = df1.merge(df2, how='inner', on=['Campaign', 'Group'], suffixes=('','_del'))
#df3
Campaign Group Metric Metric_del
0 Campaign 3 Group 1 292 23
1 Campaign 3 Group 2 373 456
但是我必须弄清楚以drop
结尾的_del
列的方法。我猜这个:
df3.select(lambda x: not re.search('_del', x), axis=1)
##The result I'm going for but required merge, then select (2-steps)
Campaign Group Metric
0 Campaign 3 Group 1 292
1 Campaign 3 Group 2 373
问题
我最感兴趣的是返回df1
,只是根据df2
的{{1}}值进行过滤。
是否有更好方式返回Campaign|Group
而不诉诸df1
?
有没有办法merge
,但不会将merge
的列返回到df2
,只返回merge
的列?
答案 0 :(得分:2)
假设您的df1
和df2
具有完全相同的列。您可以先将这些连接键列设置为索引,然后使用df1.reindex(df2.index)
和另一个.dropna()
来生成交集。
df3 = df1.set_index(['Campaign', 'Group'])
df4 = df2.set_index(['Campaign', 'Group'])
# reindex first and dropna will produce the intersection
df3.reindex(df4.index).dropna(how='all').reset_index()
Campaign Group Metric
0 Campaign 3 Group 1 292
1 Campaign 3 Group 2 373
当密钥不唯一时使用.isin
。
# create some duplicated keys and values
df3 = df3.append(df3)
df4 = df4.append(df4)
# isin
df3[df3.index.isin(df4.index)].reset_index()
Campaign Group Metric
0 Campaign 3 Group 1 292
1 Campaign 3 Group 2 373
2 Campaign 3 Group 1 292
3 Campaign 3 Group 2 373
答案 1 :(得分:0)
或者,您可以使用 groupby
和 filter
,如下所示:
# Compute the set of values you're interested in.
# In your example, this will be {('Campaign 3', 'Group 1'), ('Campaign 3', 'Group 2')}
interesting_groups = set(df2[['Campaign', 'Group']].apply(tuple, axis=1))
# Filter df1, keeping only values in that set
result = df1.groupby(['Campaign', 'Group']).filter(
lambda x: x.name in interesting_groups
)
另一个例子参见 filter
docs。