将凌乱的str替换为另一个数据帧中的clean str

时间:2019-05-23 04:53:36

标签: python string pandas contains

我有2组数据框,如果它包含df2 ['Fruits']字符串,我想清除df1 ['Fruits']

df1
Name    Fruits
--------------
Dina    Pineapple, [Y*]
Maria   PTC*, Apple
Johny   Durian, 1-6
Johny   5,6 Rambutan
Maria   Apple (Red), [Y] *
Dina    [Y] *, Peach88
Dina    Kiwi/Qiwi, PS*

df2
Fruits      tag
-------------
Apple       20
Pineapple   30
Rambutan    40
Durian      50
Apple (Red) 25
Peach88     55
Kiwi/Qiwi   25

我尝试过

df1.loc[df1['Fruits'].contains(df2['Fruits']),'Fruits'] = df2['Fruits']

但显示

  

“系列”对象没有属性“包含”

所以我期望得到的是

df1
Name    Fruits
--------------
Dina    Pineapple
Maria   Apple
Johny   Durian
Johny   Rambutan
Maria   Apple (Red)
Dina    Peach88
Dina    Kiwi/Qiwi

1 个答案:

答案 0 :(得分:2)

使用pandas.Series.str.extract

reg = '(%s)' % '|'.join(df2['Fruits'])
# Make regex expression using df2['Fruits']
df1['Fruits'] = df1['Fruits'].str.extract(reg)

输出:

    Name     Fruits
0   Dina  Pineapple
1  Maria      Apple
2  Johny     Durian
3  Johny   Rambutan

'(%s)' % '|'.join(df2['Fruits'])的解释:

  • '|'.join(df2['Fruits']):在正则表达式中为|操作创建or分隔的单词。返回Pineapple|Apple|Durian|Rambutan
  • (%s) % ...:这称为字符串格式,等效于:
    • str.format'({})'.format('|'.join(df2['Fruits']))
    • 或更隐式(但较少使用pythonic)'(' + '|'.join(df2['Fruits']) + ')'
    • 所有这些都返回(Apple|Pineapple|Rambutan|Durian),这是一个捕获组,对于pd.Series.str.extract来说,这是强制性的,知道要提取什么。