将重复值从一个DF替换到另一个DF

时间:2020-02-16 06:56:17

标签: python pandas

df1

ITEM      CATEGORY       COLOR

48684      CAR           RED
54519      BIKE          BLACK
14582      CAR           BLACK
45685      JEEP          WHITE
23661      BIKE          BLUE
23226      BIKE          BLUE
54252      BIKE          BLACK

df2

    USERID  WEBBROWSE   ITEM     PURCHASE
1   1541    CHROME      54252    YES
2   3351    EXPLORER    54519    YES
3   2639    MOBILE APP  23661    YES

df2还有许多其他列。

我需要的输出是:

    USERID  WEBBROWSE   ITEM     PURCHASE
1   1541    CHROME      54519    YES
2   3351    EXPLORER    54519    YES
3   2639    MOBILE APP  23661    YES

从df1可以清楚地看出ITEM 5425254519是相同的。因此,基于df1,我们需要替换df2中的值。

1 个答案:

答案 0 :(得分:1)

我用新列orig修改了先前的解决方案,以记住ITEM的原始值,并在另一个DataFrame中通过DataFrame.set_indexSeries.replace值创建Series:

df = df1.assign(orig=df1['ITEM'])
m = df.duplicated(['CATEGORY', 'COLOR'], keep=False)
df.loc[m, 'ITEM'] = df[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')

s = df[m].set_index('orig')['ITEM']
print (s)
orig
54519    54519
23661    23661
23226    23661
54252    54519
Name: ITEM, dtype: int64

df2['ITEM'] = df2['ITEM'].replace(s)
print (df2)
   USERID   WEBBROWSE   ITEM PURCHASE
1    1541      CHROME  54519      YES
2    3351    EXPLORER  54519      YES
3    2639  MOBILE APP  23661      YES

没有新列的另一种替代方法是用字典替换:

orig = df1['ITEM']
m = df1.duplicated(['CATEGORY', 'COLOR'], keep=False)
df1.loc[m, 'ITEM'] = df1[m].groupby(['CATEGORY', 'COLOR'])['ITEM'].transform('first')
print (df1)
    ITEM CATEGORY  COLOR
0  48684      CAR    RED
1  54519     BIKE  BLACK
2  14582      CAR  BLACK
3  45685     JEEP  WHITE
4  23661     BIKE   BLUE
5  23661     BIKE   BLUE
6  54519     BIKE  BLACK

d = dict(zip(orig[m], df1.loc[m, 'ITEM']))
print (d)
{54519: 54519, 23661: 23661}

df2['ITEM'] = df2['ITEM'].replace(d)
print (df2)
   USERID   WEBBROWSE   ITEM PURCHASE
1    1541      CHROME  54252      YES
2    3351    EXPLORER  54519      YES
3    2639  MOBILE APP  23661      YES