如何将两个数据帧与“通配符”合并?

时间:2016-06-09 17:09:58

标签: python pandas

我有一个像这样的简单数据框:

   p     b
0  a   buy
1  b   buy
2  a  sell
3  b  sell

和这样的查找表:

   p     b    v
0  a   buy  123
1  a  sell  456
2  a     *  888
4  b     *  789

如何合并(加入)两个数据帧,但尊重列b中的“通配符”,即预期结果为:

   p     b    v
0  a   buy  123
1  b   buy  789
2  a  sell  456
3  b  sell  789

我能想到的最好的是这个,但它非常丑陋且冗长:

data = pd.DataFrame([
        ['a', 'buy'],
        ['b', 'buy'],         
        ['a', 'sell'],
        ['b', 'sell'],              
    ], columns = ['p', 'b'])
lookup = pd.DataFrame([
        ['a', 'buy', 123],
        ['a', 'sell', 456],
        ['a', '*', 888],
        ['b', '*', 789],        
], columns = ['p','b', 'v'])

x = data.reset_index()
y1 = pd.merge(x, lookup, on=['p', 'b'], how='left').set_index('index')
y2 = pd.merge(x[y1['v'].isnull()], lookup, on=['p'], how='left' ).set_index('index')
data['v'] = y1['v'].fillna(y2['v'])

有更聪明的方法吗?

3 个答案:

答案 0 :(得分:5)

我认为更清洁的是首先清理wildcards

In [11]: wildcards = lookup[lookup["b"] == "*"]

In [12]: wildcards.pop("b")  # ditch the * column, it'll confuse the later merge

现在,您可以将两个合并(不需要set_index)与update合并:

In [13]: res = df.merge(lookup, how="left")

In [14]: res
Out[14]:
   p     b      v
0  a   buy  123.0
1  b   buy    NaN
2  a  sell  456.0
3  b  sell    NaN

In [15]: res.update(df.merge(wildcards, how="left"), overwrite=False)

In [16]: res
Out[16]:
   p     b      v
0  a   buy  123.0
1  b   buy  789.0
2  a  sell  456.0
3  b  sell  789.0

答案 1 :(得分:1)

我发现这很直观:

def find_lookup(lookup, p, b):
    ps = lookup.p == p
    bs = lookup.b.isin([b, '*'])
    return lookup.loc[ps & bs].iloc[0].replace('*', b)

data.apply(lambda x: find_lookup(lookup, x.loc['p'], x.loc['b']), axis=1)

   p     b    v
0  a   buy  123
1  b   buy  789
2  a  sell  456
3  b  sell  789

答案 2 :(得分:1)

我找到了另一种解决方案,受到上述一些想法的启发(非常感谢!)。它比我的第一次尝试更整洁,所以我会把它放在这里,虽然我确信还有改进的余地。此解决方案假定查找已排序,以便通配符位于表的底部:

x = data.reset_index().merge(lookup, on=['p'], suffixes=["", "_y"])
x = x[(x['b'] == x['b_y']) | (x['b_y'] == '*')]
x = x.groupby('index').first() # see note about sorting lookup!
x[['p', 'b', 'c', 'v']]

         p     b    v
index                
0     0  a   buy  123
1     6  b   buy  789
2     4  a  sell  456
3     7  b  sell  789