Extract multiple occurrences of string from dataframe column and parse into separate columns

时间:2019-04-17 02:42:51

标签: python regex pandas extract

Scenario

I'm parsing out data from one dataframe column into multiple dataframe columns. Specifically, I want to parse out all the phone numbers from a column full of emails. After I parse out the phone numbers, I want to remove those phone numbers from the original email column.

My Attempt

I start with a column in a dataframe, called "email", full of emails.

I am able to successfully parse out the first occurrence of a phone number, using regex, with the following line:

df['phone_num_1'] = df['email'].str.extract('(\(?\d\d\d\)?-? ?\.?\d\d\d-?\.?\d\d\d\d?)')

Running this line again, but with a new column name, captures the original phone number again...

I am able to remove all occurrences of phone numbers using the following line:

df['email'] = df['email'].replace('(\(?\d\d\d\)?-? ?\.?\d\d\d-?\.?\d\d\d\d?)', '', regex = True)

Now all the phone numbers are gone and I lost the second phone number.

What I Need Help With

If there are two occurrences of a phone number in my original email column, how do I capture the second occurrence? Ideally, I would like for that second occurrence of a phone number to be parsed out into its own column.

In the end, I would have 3 columns: email, phone_num_1, phone_num_2

The email column will no longer have any phone numbers.

I appreciate the help in advance!

Adding example email from dataframe

The email column might contain a cell with the following string:

Installed new heat pump. System is up and running with no leaks. Gave tenant orientation on new heat pump. installed new aqua cal heat pump Email: example@domain.com | Phone: (123) 456-7890 pool heater is not working. Please contact resident at 234.567.8901. Vendor Paid Pool/Spa Heater Equipment Pool/Spa 10088

Note the two unique phone numbers

I need each phone number extracted from that string and placed into columns of their own.

1 个答案:

答案 0 :(得分:0)

抱歉,由于缺乏有关您数据框的信息,我不理解您的意图。但是,由于您在捕获第二个电话号码时遇到了问题,因此可以帮助您确定正则表达式。我让它可以识别电子邮件,电话1和电话2。

data = ({"Email":["Installed new heat pump. System is up and running with no leaks. Gave tenant orientation on new heat pump. installed new aqua cal heat pump Email: example@domain.com | Phone: (123) 456-7890 pool heater is not working. Please contact resident at 234.567.8901. Vendor Paid Pool/Spa Heater Equipment Pool/Spa 10088"]})
df = pd.DataFrame(data)

for item in df['Email']:
    reg = re.search(r"(?P<email>\S+\@\S+)\D+(?P<ph1>\d{3}[\D]+\d{3}[\D]+\d{4})?.*(?P<ph2>\d{3}[\D]+\d{3}[\D]+\d{4})",item)
    print(list(reg.groups()))