我有一个文本文件,其中包含如下所示的数据。
从这些多组数据中,我需要提取一个特定的数据,例如10238679000 C-73652 , 5123 & 23154, 25734
。此C-73652
可能/并非存在于每个数据集中。
How can I achieve this through regex ? I feel, regex is the best option.
Or Is there any better approach for this ?
test_file.txt
Recieved request #STARTS
Data getting generated for : "time":[10238679000]
................. #CAN BE ANYTHING, BUT FEW LINES HERE
Starting data from 10238679000
A-123456 data 679720 for instance: [1452]
C-73652 data 5123 for instance: [23154, 25734]
B-967845 data 73421 for instance: [37451]
G-809573 data 38456 for instance: [92673] #ENDS
Recieved request #NEXT SET STARTS
may be same data as above or different data
In general it can have multiple set of such data
.............................. #CAN BE ANYTHING, BUT FRW LINES HERE
..............................
# SECOND SET ENDS
Recieved request #REPEATS AGAIN
我如何通过正则表达式解决这个问题?
样本输出:
At 10238679000, C-73652 generated data of 5123 units with instance 23154, 25734
如果C-73652
存在于另一组数据中,则应针对该特定数据集如上生成。
答案 0 :(得分:1)
您可以使用单独的正则表达式来匹配标题行,并存储开始时间。然后,您可以为每行使用一个正则表达式。
Starting data from (\d*)
应该适用于第一行
和([A-Z]-\d*)?\s*data\s*(\d*).*:\s*\[([\d*, ]*)\]
用于数据。
驱动程序(不是最干净/最好的实现,仅用于演示):
import re
test_data = """
Starting data from 10238679000
A-123456 data 679720 for instance: [1452]
C-73652 data 5123 for instance: [23154, 25734]
B-967845 data 73421 for instance: [37451]
G-809573 data 38456 for instance: [92673]
data 38456 for instance: [92673]
blah blah
Starting data from 121212
A-123456 data 679720 for instance: [1452]
C-73652 data 5123 for instance: [23154, 25734, 122121]]
B-967845 data 73421 for instance: [37451]
G-809573 data 38456 for instance: [92673]
data 38456 for instance: [92673]
"""
begin_rex = re.compile(r'Starting data from (\d*)')
line_rex = re.compile(r'([A-Z]-\d*)?\s*data\s*(\d*).*:\s*\[([\d*, ]*)\]')
current_time, match_line_rex = '', False
for line in test_data.splitlines():
if not match_line_rex:
begin = begin_rex.findall(line)
if begin:
current_time = int(begin[0])
match_line_rex = True
else:
data = line_rex.findall(line)
if data:
data = list(data[0])
data[2] = ' & '.join([dat.strip() for dat in data[2].split(',')])
print '{}\t{}'.format(current_time, '\t'.join(data))
else:
match_line_rex = False
输出:
10238679000 A-123456 679720 1452
10238679000 C-73652 5123 23154 & 25734
10238679000 B-967845 73421 37451
10238679000 G-809573 38456 92673
10238679000 38456 92673
121212 A-123456 679720 1452
121212 C-73652 5123 23154 & 25734 & 122121
121212 B-967845 73421 37451
121212 G-809573 38456 92673
121212 38456 92673
答案 1 :(得分:0)
实际上,很遗憾,您的问题缺少一些细节。因此,我自由地作了一些假设。以下正则表达式提取每个数据块的第二行。第1组捕获时间值10238679000
,而第2组捕获C-73652 data 5123 for instance: [23154, 25734]
。我假设您只想提取A和B前导字符之间的行。
re.findall(r'(?:Starting data from )([\d]+)\nA-.*?\n(.*)\nB', test_file)
查看实际情况here