如何从基于多列的另一个数据帧中提取熊猫数据帧?

时间:2021-04-05 07:26:51

标签: python pandas

我有两个熊猫 df 如下:-

df1

Type      season    name        qty
Fruit     summer    Mango        12
Fruit     summer    watermelon   23
Fruit     summer    blueberries  200
vegetable summer    Peppers      24


df2

Availability       season          name      city
  YEs              summer          Mango     Pune
  Yes              summer          Peppers   Mumbai
  Yes              summer          Tomatoes  Mumbai    

我想将 df2 列的季节和名称与 df1 进行比较,并返回匹配的行,其中包含一个名为 status 的额外列名,在 df1 中包含(1 表示匹配,0 表示不匹配)。在这种情况下,如下所示。

df1
Type       season    name        qty   status
Fruit      summer    Mango        12     1
Fruit      summer    watermelon   23     0
Fruit      summer    blueberries  200    0
vegetable  summer    Peppers      24     1

2 个答案:

答案 0 :(得分:3)

这是将 mergehow='left' 结合使用的另一个选项:

df1.merge(
    df2[['season', 'name']].assign(status=1),
    how='left').fillna(0)

输出:

        Type  season         name  qty  status
0      Fruit  summer        Mango   12     1.0
1      Fruit  summer   watermelon   23     0.0
2      Fruit  summer  blueberries  200     0.0
3  vegetable  summer      Peppers   24     1.0

答案 1 :(得分:0)

您可以通过以下方式使用 .isin

df1["status"] = list(zip(df1.season, df1.name))
df1["status"] = df1["status"].isin(list(zip(df2.season, df2.name)))

输出

df1
        Type  season         name  qty  status
0      Fruit  summer        Mango   12    True
1      Fruit  summer   watermelon   23   False
2      Fruit  summer  blueberries  200   False
3  vegetable  summer      Peppers   24    True

性能(对比@perl 的回答)

data = {'Type': {0: 'Fruit', 1: 'Fruit', 2: 'Fruit', 3: 'vegetable'},
 'season': {0: 'summer', 1: 'summer', 2: 'summer', 3: 'summer'},
 'name': {0: 'Mango', 1: 'watermelon', 2: 'blueberries', 3: 'Peppers'},
 'qty': {0: 12, 1: 23, 2: 200, 3: 24}}

#@perl's answer
%%timeit 
df1 = pd.DataFrame(data) 
df1.merge( 
     df2[['season', 'name']].assign(status=1), 
     how='left').fillna(0)
                                                                       
#5.44 ms ± 56.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#my answer
%%timeit
df1["status"] = list(zip(df1.season, df1.name))
df1["status"].isin(list(zip(df2.season, df2.name)))

#434 µs ± 4.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

旧的(错误的)答案

您可以将 .isin.to_dict 一起使用:

cols = ['season', 'name']
df1['status'] = df1[cols].isin(df2[cols].to_dict('list')).all(1).astype('int')

输出

df1
        Type  season         name  qty  status
0      Fruit  summer        Mango   12       1
1      Fruit  summer   watermelon   23       0
2      Fruit  summer  blueberries  200       0
3  vegetable  summer      Peppers   24       1