我有两个不同行和列大小的数据框。我想将两者进行比较,并根据df2
中是否存在值在df1
中创建新列。首先举个例子(我认为您可以将文本复制/粘贴到.csv中以进行导入),df1
如下所示:
subject block target dist1 dist2 dist3
7 1 doorlock candleholder01 jar03 stroller
7 2 glassescase clownfish kangaroo ram
7 3 badger chocolatefonduedish hosenozzle toycar04
7 4 hyena crocodile pig toad
7 1 scooter cormorant lizard rockbass
df2
像这样:
subject image
7 acorn
7 chainsaw
7 doorlock
7 stroller
7 bathtub
7 clownfish
7 bagtie
7 birdie
7 witchhat
7 crocodile
7 honeybee
7 electricitymeter
7 flowerwreath
7 jar03
7 camera02a
我想实现的是这样:
subject image present type block
7 acorn 0 NA NA
7 chainsaw 0 NA NA
7 doorlock 1 target 1
7 stroller 1 dist3 1
7 bathtub 0 NA NA
7 clownfish 1 dist1 2
7 bagtie 0 NA NA
7 birdie 0 NA NA
7 witchhat 0 NA NA
7 crocodile 1 dist1 4
7 honeybee 0 NA NA
7 electricitymeter 0 NA NA
7 flowerwreath 0 NA NA
7 jar03 1 dist2 1
7 camera02a 0 NA NA
具体来说,我想从df1('target', 'dist1', 'dist2', 'dist3'
)的4列中确定'image'
的{{1}}列中存在哪些值,然后(1)生成一个df2
中的列(布尔值或0/1),指示该值是否在df2
中存在,(2)在df1
中生成第二列,其名称为该项目所在的列的名称在df2
中(即df1
,...),最后(3)在'target', 'dist1'
中生成一列,其中包含该项目来自的df2
'block'值,如果有的话。
我希望这很清楚。我还想知道一些有关如何处理不匹配的案例的想法-我应该将它们编码为df1
还是只输入空字符串?问题是我可能以后会NAN
到来,而当df包含缺少的值时,我对groupby()
遇到了一些问题。
答案 0 :(得分:2)
您可以通过在melt
和df1
上使用merge
来做到这一点。
df1 = df1.melt(id_vars=['subject', 'block'], var_name='type', value_name='image')
df2['present'] = df2['image'].isin(df1['image']).astype(int)
pd.merge(df2, df1[['image', 'type', 'block']], on='image', how='left')
subject image present type block
0 7 acorn 0 NaN NaN
1 7 chainsaw 0 NaN NaN
2 7 doorlock 1 target 1.0
3 7 stroller 1 dist3 1.0
4 7 bathtub 0 NaN NaN
5 7 clownfish 1 dist1 2.0
6 7 bagtie 0 NaN NaN
7 7 birdie 0 NaN NaN
8 7 witchhat 0 NaN NaN
9 7 crocodile 1 dist1 4.0
10 7 honeybee 0 NaN NaN
11 7 electricitymeter 0 NaN NaN
12 7 flowerwreath 0 NaN NaN
13 7 jar03 1 dist2 1.0
14 7 camera02a 0 NaN NaN
对于缺失值,我将其保留为NaN
。熊猫在处理丢失的数据方面非常强大,因此最好利用它。