I have a column A with observations like for ex ABC01P20180821123758.There can be a variation in observation like ABC01N20180821123758 ('N', instead of 'P'). Or the observation can be P20180706035955-1 or 45312343P20180821143257-1
I want to extract only the year, month and date after P or N.
Tried different posts and solutions here. One of the solutions i tried is as below: Even though I am able to extract values after P and N, it is giving me entire string after that. I am unable to substring the year, month and date from here as this is a series and I am unable to pass 'match' as a string and I am getting stuck there. Kindly help. Is there any better way to do this.
for line in columnname:
match = re.search('P(\d+)', line)
match = re.search('N(\d+)', line)
if match:
print (match.group(1))
The output print (match.group(1)) gives the entire string after P or N. Now when I print(match), it gives the output as None.
How can I take these values into a string and subset or split it?
_______________Updated code__________________________________
for line in df.column1: match = re.search('P|N([0-9]{6})', line)
if match:
print(match.group(1))
for line in {match.group(1)}: #for every observation in the column that is matched
line = 1
while line < len(match.group(1)):
a = pd.DataFrame({'Date': {match.group(1)}}) #created a new column in a new DF. This is where my problem is. Eventhough iPython console is printing all observations that matched, when I write to excel, only the last observation is written that too in {} format. I am unable to fix this.
a.append('Date', axis=1)
line += 1
frames = [df, a]
result = pd.concat(frames) #concatenated dfs
print(result)
result.to_csv("D://A.csv", index = False)
答案 0 :(得分:0)
Try pattern r"(P|N)(\d{8})"
Ex:
import re
s = """ABC01P20180821123758 ABC01N20180821123758 P20180706035955-1 45312343P20180821143257-1"""
print(re.findall(r"(P|N)(\d{8})", s))
Output:
[('P', '20180821'), ('N', '20180821'), ('P', '20180706'), ('P', '20180821')]
答案 1 :(得分:0)
将'P(\d+)'
替换为'([N|P])([0-9]{8})'