鉴于df1和df2,我想获取df3。我要匹配的唯一列/行是 Pop 和 Homes 。我已经包含了数据的 Other 列,以获取数量众多的列的解决方案。
df1
City Pop Homes Other
City_1 100 1 0
City_1 100 2 6
City_1 100 2 2
City_1 100 3 9
City_1 200 1 6
City_1 200 2 6
City_1 200 3 7
City_1 300 1 0
df2
City Pop Homes Other
City_1 100 1 0
City_1 100 2 6
City_1 100 2 2
City_1 100 8 9
City_1 200 1 6
City_1 200 2 6
City_1 800 3 7
City_1 800 8 0
df3
City Pop Homes Other
City_1 100 1 0
City_1 100 2 6
City_1 100 2 2
City_1 200 1 6
City_1 200 2 6
我曾考虑过按城市,流行音乐和房屋进行分组,例如df1.groupby(['City','Pop','Homes']),但后来我不知道该如何滤除< em> Pop 和 Homes 。
编辑
这是我的代码,因此您可以更轻松地帮助我。
df1_string = """City_1 100 1 0
City_1 100 2 6
City_1 100 2 2
City_1 100 3 9
City_1 200 1 6
City_1 200 2 6
City_1 200 3 7
City_1 300 1 0"""
df2_string = """City_1 100 1 0
City_1 100 2 6
City_1 100 2 2
City_1 100 8 9
City_1 200 1 6
City_1 200 2 6
City_1 800 3 7
City_1 800 8 0"""
df1 = pd.DataFrame([x.split() for x in df1_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])
df2 = pd.DataFrame([x.split() for x in df2_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])
df1_keys = [x for x in df1.groupby(['Pop', 'Homes']).groups.keys()]
df2_keys = [x for x in df2.groupby(['Pop', 'Homes']).groups.keys()]
print(df1_keys)
[('100', '1'), ('100', '2'), ('100', '3'), ('200', '1'), ('200', '2'), ('200', '3'), ('300', '1')]
print(df2_keys)
[('100', '1'), ('100', '2'), ('100', '8'), ('200', '1'), ('200', '2'), ('800', '3'), ('800', '8')]
从这里看来,过滤掉不相等的组对似乎很简单,但是我不能解决这个问题。我尝试过:
df1 = df1[df1.groupby(['Pop', 'Homes']).groups.keys().isin(df2.groupby(['Pop', 'Homes']).groups.keys())]
当它不起作用时,还有其他变化-但是我感觉它已经接近工作了。
解决方案
df1.set_index(['Pop', 'Homes'], inplace=True)
df2.set_index(['Pop', 'Homes'], inplace=True)
df1 = df1[df2.index.isin(df1.index)]
df1.reset_index(inplace=True)
答案 0 :(得分:2)
IIUC,如果索引中包含“城市”,“流行”,“家”,则可以使用isin
:
df2[df2.index.isin(df1.index)]
输出:
Count
City Pop Homes
City1 100 20 152
24 184
200 41 163
42 163
答案 1 :(得分:0)
为数据框创建多索引,并为交集进行内部联接。
import pandas as pd
import numpy as np
df1_string = """City_1 100 1 0
City_1 100 2 6
City_1 100 2 2
City_1 100 3 9
City_1 200 1 6
City_1 200 2 6
City_1 200 3 7
City_1 300 1 0"""
df2_string = """City_1 100 1 0
City_1 100 2 6
City_1 100 2 2
City_1 100 8 9
City_1 200 1 6
City_1 200 2 6
City_1 800 3 7
City_1 800 8 0"""
df1 = pd.DataFrame([x.split() for x in df1_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])
df2 = pd.DataFrame([x.split() for x in df2_string.split('\n')], columns=['City', 'Pop', 'Homes', 'Other'])
# Dataframes benefit from having indexes that reflect that tabular data
df1.set_index(['City', 'Pop', 'Homes'], inplace=True)
df2.set_index(['City', 'Pop', 'Homes'], inplace=True)
# an inner join on the multiindex will provide the intersaction of the two
result = df1.join(df2, how='inner', on=['City', 'Pop', 'Homes'], lsuffix='_l', rsuffix='_r')
# a join provides all of the joined columns
result.reset_index(inplace=True)
result.drop(['Other_r'], axis=1, inplace=True)
result.columns = ['City', 'Pop', 'Homes', 'Other']
print(result)