如何从包含在另一列中的一列中删除单词?

时间:2020-06-14 18:06:58

标签: python pandas

  Audience              Ad
  Audience1     Audience4.Ad1.image
  Audience2     Audience1.Ad4.image
  Audience3     Audience7.Ad1.image
  Audience4     Audience2.Ad3.image
  Audience5     Audience9.Ad1.image
  Audience6     Audience4.Ad2.image
  Audience7     Audience5.Ad1.image
  Audience8     Audience7.Ad3.image
  Audience9     Audience8.Ad1.image
  Audience10    Audience9.Ad1.image

这是一些示例数据。我想做的是在“广告”列中查找,如果其中包含“受众群体”列中的内容,则将其替换为任何内容。 对我而言,最困难的部分是左侧可能会说Audience1,而右侧可能会说Audience2,以使它们不相同。如果他们是我知道如何执行此操作,但不幸的是,他们还没有!

因此预期结果将如下所示

  Audience      Ad
  Audience1     Ad1.image
  Audience2     Ad4.image
  Audience3     Ad1.image  
  Audience4     Ad3.image
  Audience5     Ad1.image
  Audience6     Ad2.image
  Audience7     Ad1.image
  Audience8     Ad3.image
  Audience9     Ad1.image
  Audience10    Ad1.image

我想到的方法是通过带有for循环的Audience列,然后,如果我发现Ad列中包含Audience列的任何元素,请将其删除。

这是我尝试解决的方法,但是我在返回语句中放了什么(如果其余的逻辑当然是正确的),则保持不变:

def replace(text):
    for i in df['Audience']:
        if i in text:
            return ???
df['Ad'] = df['Ad'].apply(replace)

任何帮助将不胜感激!

3 个答案:

答案 0 :(得分:2)

  • Audience转换为set,以确保没有重复的值。
  • str.split Ad
  • Ad列表中删除aud列表中的术语,并具有列表理解,然后str.join术语。

    • [y for y in x if y not in aud]list comprehension
      • 每行都使用.split转换为列表。这会遍历每个值,并检查它是否在aud列表中。是的,那么它不包含在新列表中。
      • '.'.join()从列表的元素中创建一个字符串
  • 给出了一个10e6行(df = pd.concat([pd.DataFrame(data)]*1000000))的样本数据集:

    • 此答案:Wall time: 16.9 s
    • answer中的Shubham SharmaWall time: 27.7 s
    • answer中的Ch3steRWall time: 15.7 s
      • 这次是根据df[Audience]中唯一词的数量而变化的,因为这些词正被连接成一个字符串。
import pandas as pd

# data and dataframe
data = {'Audience': ['Audience1', 'Audience2', 'Audience3', 'Audience4', 'Audience5', 'Audience6', 'Audience7', 'Audience8', 'Audience9', 'Audience10'],
        'Ad': ['Audience4.Ad1.image', 'Audience1.Ad4.image', 'Audience7.Ad1.image', 'Audience2.Ad3.image', 'Audience9.Ad1.image', 'Audience4.Ad2.image', 'Audience5.Ad1.image', 'Audience7.Ad3.image', 'Audience8.Ad1.image', 'Audience9.Ad1.image']}

df = pd.DataFrame(data)

# create list of unique words from Audience
aud = set(df.Audience.str.lower())

# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))

|    | Audience   | Ad        |
|---:|:-----------|:----------|
|  0 | Audience1  | Ad1.image |
|  1 | Audience2  | Ad4.image |
|  2 | Audience3  | Ad1.image |
|  3 | Audience4  | Ad3.image |
|  4 | Audience5  | Ad1.image |
|  5 | Audience6  | Ad2.image |
|  6 | Audience7  | Ad1.image |
|  7 | Audience8  | Ad3.image |
|  8 | Audience9  | Ad1.image |
|  9 | Audience10 | Ad1.image |

选项2:

  • 使用新的data从评论中更新
data = {'Audience': ['Football.And.Basketball.Interests', 'Baseball.Interests', 'Cricket.Interests', 'Website.Visitors'],
        'Ad': ['Baseball.Interests.Ad1.image', 'Football.And.Basketball.Interests.Ad4.image', 'Cricket.Interests.Ad1.image', 'Website.Visitors.Ad3.image']}

df = pd.DataFrame(data)

                          Audience                                           Ad
 Football.And.Basketball.Interests                 Baseball.Interests.Ad1.image
                Baseball.Interests  Football.And.Basketball.Interests.Ad4.image
                 Cricket.Interests                  Cricket.Interests.Ad1.image
                  Website.Visitors                   Website.Visitors.Ad3.image

# if Audience contains multiple values
aud = set(df.Audience.str.split('.').explode().str.lower())

# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))

                          Audience         Ad
 Football.And.Basketball.Interests  Ad1.image
                Baseball.Interests  Ad4.image
                 Cricket.Interests  Ad1.image
                  Website.Visitors  Ad3.image

答案 1 :(得分:2)

您可以将pd.Series.str.replacepd.Series.contains一起使用

mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df
     Audience         Ad
0   Audience1  Ad1.image
1   Audience2  Ad4.image
2   Audience3  Ad1.image
3   Audience4  Ad3.image
4   Audience5  Ad1.image
5   Audience6  Ad2.image
6   Audience7  Ad1.image
7   Audience8  Ad3.image
8   Audience9  Ad1.image
9  Audience10  Ad1.image

不匹配的示例:

df
      Audience                     Ad
0    Audience1    Audience4.Ad1.image
1    Audience2    Audience1.Ad4.image
2    Audience3    Audience7.Ad1.image
3    Audience4    Audience2.Ad3.image
4    Audience5    Audience9.Ad1.image
5    Audience6    Audience4.Ad2.image
6    Audience7    Audience5.Ad1.image
7    Audience8    Audience7.Ad3.image
8    Audience9    Audience8.Ad1.image
9   Audience10    Audience9.Ad1.image
10  Audience12  Audience11.Ad11.image

mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df

      Audience                     Ad
0    Audience1              Ad1.image
1    Audience2              Ad4.image
2    Audience3              Ad1.image
3    Audience4              Ad3.image
4    Audience5              Ad1.image
5    Audience6              Ad2.image
6    Audience7              Ad1.image
7    Audience8              Ad3.image
8    Audience9              Ad1.image
9   Audience10              Ad1.image
10  Audience12  Audience11.Ad11.image #---> Audience11 not deleted as 'Audience11' is not in `df['Audience']`

答案 2 :(得分:1)

使用Series.str方法和Series.isinSeries.where

s = df['Ad'].str.split('.')
m = s.str[0].isin(df['Audience'])
df['Ad'] = s.where(~m, s.str[1:]).str.join('.')

# print(df)

     Audience         Ad
0   Audience1  Ad1.image
1   Audience2  Ad4.image
2   Audience3  Ad1.image
3   Audience4  Ad3.image
4   Audience5  Ad1.image
5   Audience6  Ad2.image
6   Audience7  Ad1.image
7   Audience8  Ad3.image
8   Audience9  Ad1.image
9  Audience10  Ad1.image