如何从大型文件中提取具有重复数据块的多个模式?

时间:2019-02-20 02:59:10

标签: python regex

我有一个文本文件,其中包含如下所示的数据。 从这些多组数据中,我需要提取一个特定的数据,例如10238679000 C-73652 , 5123 & 23154, 25734。此C-73652可能/并非存在于每个数据集中。

How can I achieve this through regex ? I feel, regex is the best option.
Or Is there any better approach for this ?

test_file.txt

Recieved request        #STARTS
Data getting generated for : "time":[10238679000]
.................   #CAN BE ANYTHING, BUT FEW LINES HERE
Starting data from 10238679000
A-123456 data 679720 for instance:  [1452]
C-73652 data 5123 for instance:  [23154, 25734]
B-967845 data 73421 for instance:  [37451]
G-809573 data 38456 for instance:  [92673]     #ENDS
Recieved request     #NEXT SET STARTS
may be same data as above or different data
In general it can have multiple set of such data
..............................   #CAN BE ANYTHING, BUT FRW LINES HERE
..............................
# SECOND SET ENDS
Recieved request  #REPEATS AGAIN

我如何通过正则表达式解决这个问题?

样本输出:

At 10238679000, C-73652 generated data of 5123 units with instance 23154, 25734

如果C-73652存在于另一组数据中,则应针对该特定数据集如上生成。

2 个答案:

答案 0 :(得分:1)

您可以使用单独的正则表达式来匹配标题行,并存储开始时间。然后,您可以为每行使用一个正则表达式。

Starting data from (\d*)应该适用于第一行

([A-Z]-\d*)?\s*data\s*(\d*).*:\s*\[([\d*, ]*)\]用于数据。

驱动程序(不是最干净/最好的实现,仅用于演示):

import re

test_data = """
Starting data from 10238679000
A-123456 data 679720 for instance:  [1452]
C-73652 data 5123 for instance:  [23154, 25734]
B-967845 data 73421 for instance:  [37451]
G-809573 data 38456 for instance:  [92673]     
data 38456 for instance:  [92673]
blah blah
Starting data from 121212
A-123456 data 679720 for instance:  [1452]
C-73652 data 5123 for instance:  [23154, 25734, 122121]]
B-967845 data 73421 for instance:  [37451]
G-809573 data 38456 for instance:  [92673]     
data 38456 for instance:  [92673]

"""

begin_rex = re.compile(r'Starting data from (\d*)')
line_rex = re.compile(r'([A-Z]-\d*)?\s*data\s*(\d*).*:\s*\[([\d*, ]*)\]')

current_time, match_line_rex = '', False
for line in test_data.splitlines():
    if not match_line_rex:
        begin = begin_rex.findall(line)
        if begin:
            current_time = int(begin[0])
            match_line_rex = True
    else:
        data = line_rex.findall(line)
        if data:
            data = list(data[0])
            data[2] = ' & '.join([dat.strip() for dat in data[2].split(',')])
            print '{}\t{}'.format(current_time, '\t'.join(data))
        else:
            match_line_rex = False

输出:

10238679000 A-123456    679720  1452
10238679000 C-73652 5123    23154 & 25734
10238679000 B-967845    73421   37451
10238679000 G-809573    38456   92673
10238679000     38456   92673
121212  A-123456    679720  1452
121212  C-73652 5123    23154 & 25734 & 122121
121212  B-967845    73421   37451
121212  G-809573    38456   92673
121212      38456   92673

答案 1 :(得分:0)

实际上,很遗憾,您的问题缺少一些细节。因此,我自由地作了一些假设。以下正则表达式提取每个数据块的第二行。第1组捕获时间值10238679000,而第2组捕获C-73652 data 5123 for instance: [23154, 25734]。我假设您只想提取A和B前导字符之间的行。

re.findall(r'(?:Starting data from )([\d]+)\nA-.*?\n(.*)\nB', test_file)

查看实际情况here