我有一个txt文件,其中包含以这种方式显示的文字:
[2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
[2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
[2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good.
How about you?"
[2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.
[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.
That's good"
现在,我希望从[2018-07-11]第一次出现到最后一个之间的所有行,以及介于这两个行之间的所有行。 当前,我只是找到所有以[2018-07-11 ...开头的行并显示它们,但是如果您注意到,它们之间也很少有行会丢失。
for line in file:
if b in line: #b = system input of date
x = x + "//" + line[11:]
else:
x=x
样本输出类似于: 对于日期2018-11-17:
20:57:08] SYSTEM RESPONSE: "hello"
20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
20:57:19] SYSTEM RESPONSE: "It's going pretty good.
How about you?"
14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.
日期:2018-07-12:
14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
14:05:34] SYSTEM RESPONSE: "Okay.
That's good"
关于我也将如何获得界线的任何想法吗?由于这完全取决于日期-不可能在稍后的文本中出现。
答案 0 :(得分:5)
您可以使用正则表达式来解析行。我制作了一个函数find_lines_by_date()
,您可以在其中提供日期字符串,它将返回带有该日期的行列表:
data = """
[2018-07-11 20:57:08] SYSTEM RESPONSE: "hello"
[2018-07-11 20:57:19] USER INPUT (xvp_dev-0): "hi! how is it going?"
[2018-07-11 20:57:19] SYSTEM RESPONSE: "It's going pretty good.
How about you?"
[2018-07-11 14:05:20] USER INPUT (xvp_dev-0): I've been doing good too!
Thank you.
[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"
How is your day going today?
[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can't complain"
[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.
That's good"
"""
import re
import pprint
def find_lines_by_date(date='2018-07-11'):
rv = []
groups = re.findall(r'(\[(.*?)\s+.*?\][^\[]+)', data)
for g in groups:
if g[-1] == date:
rv.append(g[0].strip())
return rv
pprint.pprint(find_lines_by_date(date='2018-07-12'))
这将打印:
['[2018-07-12 14:05:20] SYSTEM RESPONSE: "Hello!"\n'
' How is your day going today?',
'[2018-07-12 14:05:34] USER INPUT (xvp_dev-0): "Great! Can\'t complain"',
'[2018-07-12 14:05:34] SYSTEM RESPONSE: "Okay.\n That\'s good"']
编辑:
正则表达式(\[(.*?)\s+.*?\][^\[]+)
将匹配所有两个值组的字符串(该组中的第一个值包含返回值的所有行,该组中的第二个值是比较日期)。
答案 1 :(得分:0)
您可以使用re.findall
解析数据,然后使用itertools.groupby
:
import re
dates = re.findall('\[.*?\]', content)
content = [re.findall('(?<=:)[\w\W]+', i) for i in re.sub('\[.*?\]', '*', content).split('*')]
final_content = [re.sub('\n+|\s{2,}', '', ''.join(i)) for i in content if i]
d = list(zip(dates, final_content))
new_d= [[a, list(b)] for a, b in itertools.groupby(sorted(d, key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0]), key=lambda x:re.findall('\d+\-\d+\-\d+', x[0])[0])]
final_result = {a:[c for _, c in b] for a, b in new_d}
输出:
{'2018-07-12': [' "Hello!"How is your day going today?',
' "Great! Can\'t complain"',
' "Okay.That\'s good"'],
'2018-07-11': [' "hello"',
' "hi! how is it going?"',
' "It\'s going pretty good.How about you?"',
" I've been doing good too!Thank you."]}
现在,每个日期找到的所有响应都包含在列表中,作为字典中的一个值,以日期本身为键。