包括第一次和最后一次出现之间的所有行

时间:2018-07-23 14:37:50

标签: python python-2.7

我有一个txt文件,其中包含以这种方式显示的文字:

    [2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
    [2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
    [2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good. 
     How about you?"
    [2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!

    Thank you.
    [2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
    How is your day going today?
    [2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
    [2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay. 
    That's good"

现在,我希望从[2018-07-11]第一次出现到最后一个之间的所有行,以及介于这两个行之间的所有行。 当前,我只是找到所有以[2018-07-11 ...开头的行并显示它们,但是如果您注意到,它们之间也很少有行会丢失。

for line in file:
    if b in line: #b = system input of date
       x = x + "//" + line[11:]
    else:
       x=x

样本输出类似于: 对于日期2018-11-17:

20:57:08] SYSTEM RESPONSE: "hello"
20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
20:57:19] SYSTEM RESPONSE: "It's going pretty good. 
How about you?"
14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.

日期:2018-07-12:

14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
14:05:34] SYSTEM RESPONSE: "Okay. 
That's good"

关于我也将如何获得界线的任何想法吗?由于这完全取决于日期-不可能在稍后的文本中出现。

2 个答案:

答案 0 :(得分:5)

您可以使用正则表达式来解析行。我制作了一个函数find_lines_by_date(),您可以在其中提供日期字符串,它将返回带有该日期的行列表:

data = """
    [2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
    [2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
    [2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good.
     How about you?"
    [2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!

    Thank you.
    [2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
    How is your day going today?
    [2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
    [2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.
    That's good"
"""

import re
import pprint

def find_lines_by_date(date='2018-07-11'):
    rv = []
    groups = re.findall(r'(\[(.*?)\s+.*?\][^\[]+)', data)
    for g in groups:
        if g[-1] == date:
            rv.append(g[0].strip())
    return rv


pprint.pprint(find_lines_by_date(date='2018-07-12'))

这将打印:

['[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"\n'
 '    How is your day going today?',
 '[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can\'t complain"',
 '[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.\n    That\'s good"']

编辑:

正则表达式(\[(.*?)\s+.*?\][^\[]+)将匹配所有两个值组的字符串(该组中的第一个值包含返回值的所有行,该组中的第二个值是比较日期)。

我详细介绍了simple example on external site

答案 1 :(得分:0)

您可以使用re.findall解析数据,然后使用itertools.groupby

import re
dates = re.findall('\[.*?\]', content)
content = [re.findall('(?<=:)[\w\W]+', i) for i in re.sub('\[.*?\]', '*', content).split('*')]
final_content = [re.sub('\n+|\s{2,}', '', ''.join(i)) for i in content if i]
d = list(zip(dates, final_content))
new_d= [[a, list(b)] for a, b in itertools.groupby(sorted(d, key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0]), key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0])]
final_result = {a:[c for _, c in b] for a, b in new_d}

输出:

{'2018-07-12': [' "Hello!"How is your day going today?', 
                ' "Great! Can\'t complain"', 
                ' "Okay.That\'s good"'], 
 '2018-07-11': [' "hello"', 
                ' "hi! how is it going?"', 
                ' "It\'s going pretty good.How about you?"', 
                " I've been doing good too!Thank you."]}

现在,每个日期找到的所有响应都包含在列表中,作为字典中的一个值,以日期本身为键。