如何收集文件中关键字之间的所有数据行-在换行符处开始+结束

时间:2018-11-08 21:48:30

标签: python regex python-3.x parsing

我正在尝试从非常大的日志文件中收集特定信息,但无法弄清楚如何获得所需的行为。

作为参考,示例日志有点像这样:

garbage I don't need - garbage I don't need
timestamp - date - server info - 'keyword 1' - data
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data
garbage I don't need - garbage I don't need

我需要找到“关键字1”,抓住整行关键字1处于打开状态(返回时间戳记),然后捕获所有后续行,直到(并包括)“关键字2”处于打开状态的整个行(直到最后一行)位数据)。

到目前为止,我已经尝试了一些方法。我无法通过re方法(findall,match,search等)获得不错的结果;我无法弄清楚如何在比赛之前(甚至是回头一看)获取数据,但更重要的是,我无法弄清楚如何使捕获停止在一个短语而不是单个字符上。

for match in re.findall('keyword1[keyword2]+|', showall.read()):

我也尝试过这样的事情:

start_capture = False
for current_line in fileName:
    if 'keyword1' in current_line:
        start_capture = True
    if start_capture:
        new_list.append(current_line)
    if 'keyword2' in current_line:
        return(new_list)

无论我尝试了什么,都会返回一个空列表

最后,我尝试了这样的事情:

def takewhile_plus_next(predicate, xs):
for x in xs:
    if not predicate(x):
        break
    yield x
yield x
with lastdb as f:
    lines = map(str.rstrip, f)
    skipped = dropwhile(lambda line: 'Warning: fatal assert' not in line, lines)
    lines_to_keep = takewhile_plus_next(lambda line: 'uptime:' not in line, skipped)

这最后一个将关键字1到EOF的所有内容都包括了近100,000行的垃圾数据。

3 个答案:

答案 0 :(得分:1)

如果您指定re.dotall并使用惰性代码,则可以使用正则表达式。*?匹配开始和结束:

import re

regex = r"\n.*?(keyword 1).*?(keyword 2).*?$"

test_str = ("garbage I don't need - garbage I don't need\n"
    "timestamp - date - server info - 'keyword 1' - data\n"
    "more data more data more data more data\n"
    "more data more data more data more data\n"
    "more data more data 'keyword 2' - last bit of data\n"
    "garbage I don't need - garbage I don't need")

matches = re.finditer(regex, test_str, re.DOTALL | re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print (match.group()) # your match is the whole group

输出:

timestamp - date - server info - 'keyword 1' - data 
more data more data more data more data
more data more data more data more data
more data more data 'keyword 2' - last bit of data

您可能需要strip('\n') ...

您可以在这里查看它:https://regex101.com/r/HWIALZ/1-它还包含模式的说明。它的简称:

\n        newline 
   .*?    as few as possible anythings
   (keyword 1)   literal text - the () are not needed only if you want the group
   .*?    as few as possible anythings
   (keyword 2)   literal text - again () are not needed 
   .*?    as few as possible anythings
$         end of line

为了清楚起见,我包括了()-您不评估组,而是将其删除。

答案 1 :(得分:1)

对于任何大小的文件,以下内容都很快速。它在3秒钟内从250万行日志文件中提取了近200万行。提取的部分位于文件的末尾。

如果您的文件可能无法容纳在可用内存中,我不建议使用list,正则表达式或其他内存技术。

测试文本文件startstop_text

line 1 this should not appear in output
line 2 keyword1
line 3 appears in output
line 4 keyword2
line 5 this should not appear in output

代码:

from itertools import dropwhile


def keepuntil(contains_end_keyword, lines):
    for line in lines:
        yield line
        if contains_end_keyword(line):
            break


with open('startstop_text', 'r') as f:
    from_start_line = dropwhile(lambda line: 'keyword1' not in line, f)
    extracted = keepuntil(lambda line: 'keyword2' in line, from_start_line)
    for line in extracted:
        print(line.rstrip())


>>> python startstop.py
line 2 keyword1
line 3 appears in output
line 4 keyword2

答案 2 :(得分:-1)

其他响应均无用,但我能够使用正则表达式弄清楚。

for match in re.findall(".*keyword1[\s\S]*?keyword2:[\s\S]*?keyword3.*", log_file.read()):