替换python pandas数据帧中的字符后的子字符串

时间:2016-01-11 18:21:06

标签: string pandas dataframe

我是大熊猫的新手,我在这方面遇到了很多麻烦,尽管我搜索过,却没有找到解决方案。希望你们中的一个能帮助我。

我有一个pandas数据框,其中有一列我正在尝试清理的电子邮件。一些例子是:

>>> email['EMAIL']
0              testing@...com
1                         NaN
2           I.am.ME@GAMIL.COM
3    FIRST.LAST.NAME@MAIL.CMO
4    EMAIL+REMOVE@TESTING.COM
Name: EMAIL, dtype: object

我想在这里做很多事情:

1)用正确的拼写(例如COM)替换拼错的结尾(例如CMO)

2)用正确的拼写替换拼写错误的域名

3)在'@'符号后用1个句点替换多个句点。

4)如果他们拥有Gmail帐户,请删除“@”符号前的所有句点

5)删除“+”符号后面的所有字符,直至“@”符号

所以,从上面的例子我会回来:

>>> email['EMAIL']
0                testing@.com
1                         NaN
2             IamME@GMAIL.COM
3    FIRST.LAST.NAME@MAIL.COM
4           EMAIL@TESTING.COM
Name: EMAIL, dtype: object

我已经处理了许多不同的代码并且一直遇到错误。这是迄今为止我最好的猜测之一,用于删除“@”符号后的多个句点。

def remove_periods(email):
    email_split = email['EMAIL'].str.split('@')
    ending = email_split.str.get(-1)
    ending = ending.str.replace('\.{2,}', '.') 
    emailupdate = email_split.str[:-1]
    emailupdate.append(ending)
    email_split.str.get()
    return '@'.join(emailupdate)
email['EMAIL'].apply(remove_periods)

我也可以打印其他多个版本,但它们都会返回错误。

非常感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

import numpy as np
import pandas as pd

pd.options.display.width = 1000
email = pd.DataFrame({'EMAIL':[
    'testing@...com', np.nan, 'I.am.ME@GAMIL.COM', 'FIRST.LAST.NAME@MAIL.CMO', 
    'EMAIL+REMOVE@TESTING.COM', 'gamil@bar...com', 'noperiods@localhost']})

email[['NAME', '@', 'ADDR']] = email['EMAIL'].str.rpartition('@')

# 1) replace misspelled endings (e.g. COM) with correct spellings 
email['ADDR'] = email['ADDR'].str.replace(r'(?i)CMO$', 'COM')
# 2) replace misspelled domain names with correct spellings 
email['ADDR'] = email['ADDR'].str.replace(r'(?i)GAMIL', 'GMAIL')
# 3) replace multiple periods with just 1 period AFTER the '@' symbol. 
email['ADDR'] = email['ADDR'].str.replace(r'[.]{2,}', '.')
# 4) remove all periods before the '@' sign if they have a gmail account 
mask = email['ADDR'].str.contains(r'(?i)^GMAIL[.]COM$') == True
email.loc[mask, 'NAME'] = email.loc[mask, 'NAME'].str.replace(r'[.]', '')
# 5) remove all characters after the "+" symbol up to the '@' symbol
email['NAME'] = email['NAME'].str.replace(r'[+].*', '')

# put it back together. You could reassign to email['EMAIL'] if you wish.
email['NEW_EMAIL'] = email['NAME'] + email['@'] + email['ADDR']

# clean up intermediate columns
# del email[['NAME', '@', 'ADDR']]
print(email)

产量

                      EMAIL             NAME     @         ADDR                 NEW_EMAIL
0            testing@...com          testing     @         .com              testing@.com
1                       NaN              NaN  None         None                       NaN
2         I.am.ME@GAMIL.COM            IamME     @    GMAIL.COM           IamME@GMAIL.COM
3  FIRST.LAST.NAME@MAIL.CMO  FIRST.LAST.NAME     @     MAIL.COM  FIRST.LAST.NAME@MAIL.COM
4  EMAIL+REMOVE@TESTING.COM            EMAIL     @  TESTING.COM         EMAIL@TESTING.COM
5           gamil@bar...com            gamil     @      bar.com             gamil@bar.com
6       noperiods@localhost        noperiods     @    localhost       noperiods@localhost

NAME列包含最后一个@之前的所有内容 ADDR列保留最后@之后的所有内容。

我离开了NAME,ADDR列可见(并没有覆盖原来的EMAIL列) 所以更容易理解中间步骤。