我是大熊猫的新手,我在这方面遇到了很多麻烦,尽管我搜索过,却没有找到解决方案。希望你们中的一个能帮助我。
我有一个pandas数据框,其中有一列我正在尝试清理的电子邮件。一些例子是:
>>> email['EMAIL']
0 testing@...com
1 NaN
2 I.am.ME@GAMIL.COM
3 FIRST.LAST.NAME@MAIL.CMO
4 EMAIL+REMOVE@TESTING.COM
Name: EMAIL, dtype: object
我想在这里做很多事情:
1)用正确的拼写(例如COM)替换拼错的结尾(例如CMO)
2)用正确的拼写替换拼写错误的域名
3)在'@'符号后用1个句点替换多个句点。
4)如果他们拥有Gmail帐户,请删除“@”符号前的所有句点
5)删除“+”符号后面的所有字符,直至“@”符号
所以,从上面的例子我会回来:
>>> email['EMAIL']
0 testing@.com
1 NaN
2 IamME@GMAIL.COM
3 FIRST.LAST.NAME@MAIL.COM
4 EMAIL@TESTING.COM
Name: EMAIL, dtype: object
我已经处理了许多不同的代码并且一直遇到错误。这是迄今为止我最好的猜测之一,用于删除“@”符号后的多个句点。
def remove_periods(email):
email_split = email['EMAIL'].str.split('@')
ending = email_split.str.get(-1)
ending = ending.str.replace('\.{2,}', '.')
emailupdate = email_split.str[:-1]
emailupdate.append(ending)
email_split.str.get()
return '@'.join(emailupdate)
email['EMAIL'].apply(remove_periods)
我也可以打印其他多个版本,但它们都会返回错误。
非常感谢您的帮助!
答案 0 :(得分:0)
import numpy as np
import pandas as pd
pd.options.display.width = 1000
email = pd.DataFrame({'EMAIL':[
'testing@...com', np.nan, 'I.am.ME@GAMIL.COM', 'FIRST.LAST.NAME@MAIL.CMO',
'EMAIL+REMOVE@TESTING.COM', 'gamil@bar...com', 'noperiods@localhost']})
email[['NAME', '@', 'ADDR']] = email['EMAIL'].str.rpartition('@')
# 1) replace misspelled endings (e.g. COM) with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)CMO$', 'COM')
# 2) replace misspelled domain names with correct spellings
email['ADDR'] = email['ADDR'].str.replace(r'(?i)GAMIL', 'GMAIL')
# 3) replace multiple periods with just 1 period AFTER the '@' symbol.
email['ADDR'] = email['ADDR'].str.replace(r'[.]{2,}', '.')
# 4) remove all periods before the '@' sign if they have a gmail account
mask = email['ADDR'].str.contains(r'(?i)^GMAIL[.]COM$') == True
email.loc[mask, 'NAME'] = email.loc[mask, 'NAME'].str.replace(r'[.]', '')
# 5) remove all characters after the "+" symbol up to the '@' symbol
email['NAME'] = email['NAME'].str.replace(r'[+].*', '')
# put it back together. You could reassign to email['EMAIL'] if you wish.
email['NEW_EMAIL'] = email['NAME'] + email['@'] + email['ADDR']
# clean up intermediate columns
# del email[['NAME', '@', 'ADDR']]
print(email)
产量
EMAIL NAME @ ADDR NEW_EMAIL
0 testing@...com testing @ .com testing@.com
1 NaN NaN None None NaN
2 I.am.ME@GAMIL.COM IamME @ GMAIL.COM IamME@GMAIL.COM
3 FIRST.LAST.NAME@MAIL.CMO FIRST.LAST.NAME @ MAIL.COM FIRST.LAST.NAME@MAIL.COM
4 EMAIL+REMOVE@TESTING.COM EMAIL @ TESTING.COM EMAIL@TESTING.COM
5 gamil@bar...com gamil @ bar.com gamil@bar.com
6 noperiods@localhost noperiods @ localhost noperiods@localhost
NAME列包含最后一个@
之前的所有内容
ADDR列保留最后@
之后的所有内容。
我离开了NAME,ADDR列可见(并没有覆盖原来的EMAIL
列)
所以更容易理解中间步骤。