我有一个像这样的简单数据框:
p b
0 a buy
1 b buy
2 a sell
3 b sell
和这样的查找表:
p b v
0 a buy 123
1 a sell 456
2 a * 888
4 b * 789
如何合并(加入)两个数据帧,但尊重列b中的“通配符”,即预期结果为:
p b v
0 a buy 123
1 b buy 789
2 a sell 456
3 b sell 789
我能想到的最好的是这个,但它非常丑陋且冗长:
data = pd.DataFrame([
['a', 'buy'],
['b', 'buy'],
['a', 'sell'],
['b', 'sell'],
], columns = ['p', 'b'])
lookup = pd.DataFrame([
['a', 'buy', 123],
['a', 'sell', 456],
['a', '*', 888],
['b', '*', 789],
], columns = ['p','b', 'v'])
x = data.reset_index()
y1 = pd.merge(x, lookup, on=['p', 'b'], how='left').set_index('index')
y2 = pd.merge(x[y1['v'].isnull()], lookup, on=['p'], how='left' ).set_index('index')
data['v'] = y1['v'].fillna(y2['v'])
有更聪明的方法吗?
答案 0 :(得分:5)
我认为更清洁的是首先清理wildcards
:
In [11]: wildcards = lookup[lookup["b"] == "*"]
In [12]: wildcards.pop("b") # ditch the * column, it'll confuse the later merge
现在,您可以将两个合并(不需要set_index
)与update
合并:
In [13]: res = df.merge(lookup, how="left")
In [14]: res
Out[14]:
p b v
0 a buy 123.0
1 b buy NaN
2 a sell 456.0
3 b sell NaN
In [15]: res.update(df.merge(wildcards, how="left"), overwrite=False)
In [16]: res
Out[16]:
p b v
0 a buy 123.0
1 b buy 789.0
2 a sell 456.0
3 b sell 789.0
答案 1 :(得分:1)
我发现这很直观:
def find_lookup(lookup, p, b):
ps = lookup.p == p
bs = lookup.b.isin([b, '*'])
return lookup.loc[ps & bs].iloc[0].replace('*', b)
data.apply(lambda x: find_lookup(lookup, x.loc['p'], x.loc['b']), axis=1)
p b v
0 a buy 123
1 b buy 789
2 a sell 456
3 b sell 789
答案 2 :(得分:1)
我找到了另一种解决方案,受到上述一些想法的启发(非常感谢!)。它比我的第一次尝试更整洁,所以我会把它放在这里,虽然我确信还有改进的余地。此解决方案假定查找已排序,以便通配符位于表的底部:
x = data.reset_index().merge(lookup, on=['p'], suffixes=["", "_y"])
x = x[(x['b'] == x['b_y']) | (x['b_y'] == '*')]
x = x.groupby('index').first() # see note about sorting lookup!
x[['p', 'b', 'c', 'v']]
p b v
index
0 0 a buy 123
1 6 b buy 789
2 4 a sell 456
3 7 b sell 789