识别熊猫中两个不同大小的数据框的公共列值

时间:2018-08-11 18:26:56

标签: python-3.x pandas pandas-groupby

我有两个不同行和列大小的数据框。我想将两者进行比较,并根据df2中是否存在值在df1中创建新列。首先举个例子(我认为您可以将文本复制/粘贴到.csv中以进行导入),df1如下所示:

subject block   target  dist1   dist2   dist3
7   1   doorlock    candleholder01  jar03   stroller
7   2   glassescase clownfish   kangaroo    ram
7   3   badger  chocolatefonduedish hosenozzle  toycar04
7   4   hyena   crocodile   pig toad
7   1   scooter cormorant   lizard  rockbass

df2像这样:

subject image
7   acorn
7   chainsaw
7   doorlock
7   stroller
7   bathtub
7   clownfish
7   bagtie
7   birdie
7   witchhat
7   crocodile
7   honeybee
7   electricitymeter
7   flowerwreath
7   jar03
7   camera02a

我想实现的是这样:

subject image   present type    block
7   acorn   0   NA  NA
7   chainsaw    0   NA  NA
7   doorlock    1   target  1
7   stroller    1   dist3   1
7   bathtub 0   NA  NA
7   clownfish   1   dist1   2
7   bagtie  0   NA  NA
7   birdie  0   NA  NA
7   witchhat    0   NA  NA
7   crocodile   1   dist1   4
7   honeybee    0   NA  NA
7   electricitymeter    0   NA  NA
7   flowerwreath    0   NA  NA
7   jar03   1   dist2   1
7   camera02a   0   NA  NA

具体来说,我想从df1('target', 'dist1', 'dist2', 'dist3')的4列中确定'image'的{​​{1}}列中存在哪些值,然后(1)生成一个df2中的列(布尔值或0/1),指示该值是否在df2中存在,(2)在df1中生成第二列,其名称为该项目所在的列的名称在df2中(即df1,...),最后(3)在'target', 'dist1'中生成一列,其中包含该项目来自的df2'block'值,如果有的话。

我希望这很清楚。我还想知道一些有关如何处理不匹配的案例的想法-我应该将它们编码为df1还是只输入空字符串?问题是我可能以后会NAN到来,而当df包含缺少的值时,我对groupby()遇到了一些问题。

1 个答案:

答案 0 :(得分:2)

您可以通过在meltdf1上使用merge来做到这一点。

df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')

    subject             image   present     type    block
0         7             acorn         0      NaN    NaN
1         7          chainsaw         0      NaN    NaN
2         7          doorlock         1   target    1.0
3         7          stroller         1    dist3    1.0
4         7           bathtub         0      NaN    NaN
5         7         clownfish         1    dist1    2.0
6         7            bagtie         0      NaN    NaN
7         7            birdie         0      NaN    NaN
8         7          witchhat         0      NaN    NaN
9         7         crocodile         1    dist1    4.0
10        7          honeybee         0      NaN    NaN
11        7  electricitymeter         0      NaN    NaN
12        7      flowerwreath         0      NaN    NaN
13        7             jar03         1    dist2    1.0
14        7         camera02a         0      NaN    NaN

对于缺失值,我将其保留为NaN。熊猫在处理丢失的数据方面非常强大,因此最好利用它。