我试图将电子邮件清除功能应用于列,并将结果记录在单独的列中。
我不确定如何使用.apply()
将函数应用于两列,但这是我尝试过的方法:
首先设置数据框,并列出常见电子邮件错误的字典:
import pandas as pd
df = pd.DataFrame({'emails':['jim@gmailcom','bob@gmail.com','mary@gmaicom','bobby@gmail.com'],
'result':['','','','']})
df
emails result
0 jim@gmailcom
1 bob@gmail.com
2 mary@gmaicom
3 bobby@gmail.com
# common mistakes:
correct_domain = {'gmailcom': 'gmail.com',
'gmaicom': 'gmail.com',
'gmaillom': 'gmail.com',
'gmalcom': 'gmail.com'}
现在,我想浏览电子邮件,并用正确的域替换拼写错误的域。例如。 gmailcom-> gmail.com
def clean_emails(x):
# for each domain(key) in this dict ( e.g. 'gmailcom':'gmail.com')
for mistake in correct_domain:
# if incorrect domain ('gmailcom') is in the email we're checking
if mistake in x['emails']:
# replace it with the dict value which is the correctly formatted domain ('gmail.com')
x['emails'] = x['emails'].replace(mistake ,correct_domain[mistake ])
# record result
x['result'] = 'email cleaned'
else:
x['result'] = 'no cleaning needed'
然后,当我应用此功能时,我什么也没得到:
df.apply(clean_emails,axis=1)
0 None
1 None
2 None
3 None
dtype: object
我尝试使用return
进行混合,但是无法为单独的列找出两个单独的返回值。
我想要的结果,电子邮件已经清理,结果记录到result
:
emails result
0 jim@gmail.com 'email cleaned'
1 bob@gmail.com 'no cleaning needed'
2 mary@gmail.com 'email cleaned'
3 bobby@gmail.com 'no cleaning needed'
编辑:我以为在函数的末尾添加return x
会返回新编辑的行,但是电子邮件没有被清除。
emails result
0 jim@gmail.com email cleaned
1 bob@gmail.com no cleaning needed
2 mary@gmaicom no cleaning needed
3 bobby@gmail.com no cleaning needed
答案 0 :(得分:1)
使用Series.str.contains
检查是否需要用numpy.where
进行按条件列清洁,然后使用Series.str.replace
进行回调以仅用字典替换必要的行:
pat = '|'.join(correct_domain.keys())
m = df['emails'].str.contains(pat, na=False)
df['result'] = np.where(m, 'email cleaned', 'no cleaning needed')
df.loc[m, 'emails'] = (df.loc[m, 'emails']
.str.replace(pat, lambda x: correct_domain[x.group()], regex=True))
print (df)
emails result
0 jim@gmail.com email cleaned
1 bob@gmail.com no cleaning needed
2 mary@gmail.com email cleaned
3 bobby@gmail.com no cleaning needed
答案 1 :(得分:1)
为什么不是两线制:
df['result'] = df['emails'].str.contains('|'.join(correct_domain.keys()).map({0:'email cleaned', 1:'no cleaning needed'})
df['emails'] = df['emails'].str.replace('|'.join(correct_domain.keys()),list(correct_domain.values())[0])
现在:
print(df)
将会是:
emails result
0 jim@gmail.com email cleaned
1 bob@gmail.com no cleaning needed
2 mary@gmail.com email cleaned
3 bobby@gmail.com no cleaning needed
答案 2 :(得分:0)
我一直在想,我已经看到您已经提供了许多解决方案。按照您的逻辑,我们可以像这样到达那里:
df = pd.DataFrame({'emails':['jim@gmailcom','bob@gmail.com','mary@gmaicom','bobby@gmail.com']})
regexExp = [r'gmailcom$', r'gmaicom$', r'gmaillom', r'gmalcom']
df2 = df.replace(regex=regexExp, value='gmail.com')
result = []
for dfLines, df2Lines in zip(df.itertuples(),df2.itertuples()):
if df2Lines.emails != dfLines.emails:
result.append('email cleaned')
else:
result.append('no cleaning needed')
df2['result'] = result
print(df2)