Audience Ad
Audience1 Audience4.Ad1.image
Audience2 Audience1.Ad4.image
Audience3 Audience7.Ad1.image
Audience4 Audience2.Ad3.image
Audience5 Audience9.Ad1.image
Audience6 Audience4.Ad2.image
Audience7 Audience5.Ad1.image
Audience8 Audience7.Ad3.image
Audience9 Audience8.Ad1.image
Audience10 Audience9.Ad1.image
这是一些示例数据。我想做的是在“广告”列中查找,如果其中包含“受众群体”列中的内容,则将其替换为任何内容。 对我而言,最困难的部分是左侧可能会说Audience1,而右侧可能会说Audience2,以使它们不相同。如果他们是我知道如何执行此操作,但不幸的是,他们还没有!
因此预期结果将如下所示
Audience Ad
Audience1 Ad1.image
Audience2 Ad4.image
Audience3 Ad1.image
Audience4 Ad3.image
Audience5 Ad1.image
Audience6 Ad2.image
Audience7 Ad1.image
Audience8 Ad3.image
Audience9 Ad1.image
Audience10 Ad1.image
我想到的方法是通过带有for循环的Audience列,然后,如果我发现Ad列中包含Audience列的任何元素,请将其删除。
这是我尝试解决的方法,但是我在返回语句中放了什么(如果其余的逻辑当然是正确的),则保持不变:
def replace(text):
for i in df['Audience']:
if i in text:
return ???
df['Ad'] = df['Ad'].apply(replace)
任何帮助将不胜感激!
答案 0 :(得分:2)
Audience
转换为set
,以确保没有重复的值。str.split
Ad
列从Ad
列表中删除aud
列表中的术语,并具有列表理解,然后str.join
术语。
[y for y in x if y not in aud]
是list comprehension
.split
转换为列表。这会遍历每个值,并检查它是否在aud
列表中。是的,那么它不包含在新列表中。'.'.join()
从列表的元素中创建一个字符串给出了一个10e6行(df = pd.concat([pd.DataFrame(data)]*1000000)
)的样本数据集:
Wall time: 16.9 s
Wall time: 27.7 s
Wall time: 15.7 s
df[Audience]
中唯一词的数量而变化的,因为这些词正被连接成一个字符串。import pandas as pd
# data and dataframe
data = {'Audience': ['Audience1', 'Audience2', 'Audience3', 'Audience4', 'Audience5', 'Audience6', 'Audience7', 'Audience8', 'Audience9', 'Audience10'],
'Ad': ['Audience4.Ad1.image', 'Audience1.Ad4.image', 'Audience7.Ad1.image', 'Audience2.Ad3.image', 'Audience9.Ad1.image', 'Audience4.Ad2.image', 'Audience5.Ad1.image', 'Audience7.Ad3.image', 'Audience8.Ad1.image', 'Audience9.Ad1.image']}
df = pd.DataFrame(data)
# create list of unique words from Audience
aud = set(df.Audience.str.lower())
# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))
| | Audience | Ad |
|---:|:-----------|:----------|
| 0 | Audience1 | Ad1.image |
| 1 | Audience2 | Ad4.image |
| 2 | Audience3 | Ad1.image |
| 3 | Audience4 | Ad3.image |
| 4 | Audience5 | Ad1.image |
| 5 | Audience6 | Ad2.image |
| 6 | Audience7 | Ad1.image |
| 7 | Audience8 | Ad3.image |
| 8 | Audience9 | Ad1.image |
| 9 | Audience10 | Ad1.image |
data
从评论中更新data = {'Audience': ['Football.And.Basketball.Interests', 'Baseball.Interests', 'Cricket.Interests', 'Website.Visitors'],
'Ad': ['Baseball.Interests.Ad1.image', 'Football.And.Basketball.Interests.Ad4.image', 'Cricket.Interests.Ad1.image', 'Website.Visitors.Ad3.image']}
df = pd.DataFrame(data)
Audience Ad
Football.And.Basketball.Interests Baseball.Interests.Ad1.image
Baseball.Interests Football.And.Basketball.Interests.Ad4.image
Cricket.Interests Cricket.Interests.Ad1.image
Website.Visitors Website.Visitors.Ad3.image
# if Audience contains multiple values
aud = set(df.Audience.str.split('.').explode().str.lower())
# remove Audience words from Ad column
df.Ad = df.Ad.str.split('.').apply(lambda x: '.'.join([y for y in x if y.lower() not in aud]))
Audience Ad
Football.And.Basketball.Interests Ad1.image
Baseball.Interests Ad4.image
Cricket.Interests Ad1.image
Website.Visitors Ad3.image
答案 1 :(得分:2)
您可以将pd.Series.str.replace
与pd.Series.contains
一起使用
mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df
Audience Ad
0 Audience1 Ad1.image
1 Audience2 Ad4.image
2 Audience3 Ad1.image
3 Audience4 Ad3.image
4 Audience5 Ad1.image
5 Audience6 Ad2.image
6 Audience7 Ad1.image
7 Audience8 Ad3.image
8 Audience9 Ad1.image
9 Audience10 Ad1.image
不匹配的示例:
df
Audience Ad
0 Audience1 Audience4.Ad1.image
1 Audience2 Audience1.Ad4.image
2 Audience3 Audience7.Ad1.image
3 Audience4 Audience2.Ad3.image
4 Audience5 Audience9.Ad1.image
5 Audience6 Audience4.Ad2.image
6 Audience7 Audience5.Ad1.image
7 Audience8 Audience7.Ad3.image
8 Audience9 Audience8.Ad1.image
9 Audience10 Audience9.Ad1.image
10 Audience12 Audience11.Ad11.image
mask = df['Ad'].str.contains('\.|'.join(set(df['Audience'])))
df.loc[mask,'Ad'] = df.loc[mask,'Ad'].str.replace(r'(Audience\d+.)','')
df
Audience Ad
0 Audience1 Ad1.image
1 Audience2 Ad4.image
2 Audience3 Ad1.image
3 Audience4 Ad3.image
4 Audience5 Ad1.image
5 Audience6 Ad2.image
6 Audience7 Ad1.image
7 Audience8 Ad3.image
8 Audience9 Ad1.image
9 Audience10 Ad1.image
10 Audience12 Audience11.Ad11.image #---> Audience11 not deleted as 'Audience11' is not in `df['Audience']`
答案 2 :(得分:1)
使用Series.str
方法和Series.isin
,Series.where
:
s = df['Ad'].str.split('.')
m = s.str[0].isin(df['Audience'])
df['Ad'] = s.where(~m, s.str[1:]).str.join('.')
# print(df)
Audience Ad
0 Audience1 Ad1.image
1 Audience2 Ad4.image
2 Audience3 Ad1.image
3 Audience4 Ad3.image
4 Audience5 Ad1.image
5 Audience6 Ad2.image
6 Audience7 Ad1.image
7 Audience8 Ad3.image
8 Audience9 Ad1.image
9 Audience10 Ad1.image