Extract a string between a series object

时间:2018-12-03 13:13:37

标签: python regex-lookarounds

I have a column A with observations like for ex ABC01P20180821123758.There can be a variation in observation like ABC01N20180821123758 ('N', instead of 'P'). Or the observation can be P20180706035955-1 or 45312343P20180821143257-1

I want to extract only the year, month and date after P or N.

Tried different posts and solutions here. One of the solutions i tried is as below: Even though I am able to extract values after P and N, it is giving me entire string after that. I am unable to substring the year, month and date from here as this is a series and I am unable to pass 'match' as a string and I am getting stuck there. Kindly help. Is there any better way to do this.

for line in columnname:

match = re.search('P(\d+)', line)
match = re.search('N(\d+)', line)

if match:
   print (match.group(1))

The output print (match.group(1)) gives the entire string after P or N. Now when I print(match), it gives the output as None.

How can I take these values into a string and subset or split it?

_______________Updated code__________________________________

for line in df.column1: match = re.search('P|N([0-9]{6})', line)

if match:
        print(match.group(1))
        for line in  {match.group(1)}: #for every observation in the column that is matched
                 line = 1
                 while line < len(match.group(1)):

                     a = pd.DataFrame({'Date':  {match.group(1)}})  #created a new column in a new DF. This is where my problem is. Eventhough iPython console is printing all observations that matched, when I write to excel, only the last observation is written that too in {} format. I am unable to fix this.

                     a.append('Date', axis=1)
                     line += 1

                     frames = [df, a]

                     result = pd.concat(frames) #concatenated dfs
                     print(result)

                     result.to_csv("D://A.csv", index = False)

2 个答案:

答案 0 :(得分:0)

Try pattern r"(P|N)(\d{8})"

Ex:

import re

s = """ABC01P20180821123758 ABC01N20180821123758 P20180706035955-1 45312343P20180821143257-1"""
print(re.findall(r"(P|N)(\d{8})", s))

Output:

[('P', '20180821'), ('N', '20180821'), ('P', '20180706'), ('P', '20180821')]

答案 1 :(得分:0)

'P(\d+)'替换为'([N|P])([0-9]{8})'