我有一个日志文件,其中包含一个日期/时间,在下一个日期/时间之间有不同的行数
例如。
时间日期
18/07/2 13:55:00.983
msecVal = pyparsing.Word(pyparsing.nums, max=3)
numPair = pyparsing.Word(pyparsing.nums, exact=2)
dateStr = pyparsing.Combine(numPair + '/' + numPair + '/' + numPair)
timeString = pyparsing.Combine(numPair + ':' + numPair + ':' + numPair\
+ '.' + msecVal)
日志文件将是
time:date: line of text
possible 2nd line of text
possible 3rd line of text...
time:date: line of text
time:date: line of text
possible 2nd line of text
possible 3rd line of text...
possible <n> line of text...
time:date: line of text
输入将是上述格式的大文本日志文件。我想生成一个分组元素的列表
[[time],[all text until next time]],[[time],[all text until next time]...
如果每个时间/日期条目都为一行,则可以执行此操作。它跨越多个行的随机数,直到下一个我遇到问题的时间/日期条目为止。
答案 0 :(得分:0)
这是我解释您对日志实体的定义的方式:
“行首的日期时间,后跟冒号,然后是所有内容 直到行首的下一个日期时间为止,即使可能存在日期时间 嵌入行中。”
您需要解决以下两种pyparsing功能:
LineStart-区分行开头和行正文中的日期时间
SkipTo-跳过非结构化文本直到找到匹配表达式的快速方法
我在您的代码中添加了这些表达式(由于我是一个懒惰的打字员,我将pyparsing导入为“ pp”):
dateTime = dateStr + timeString
# log entry date-time keys only match if they are at the start of the line
dateTimeKey = pp.LineStart() + dateTime
# define a log entry as a date-time key, followed by everything up to the next
# date-time key, or to the end of the input string
# (use results names to make it easy to get at the parts of the log entry)
logEntry = pp.Group(dateTimeKey("time") + ':' + pp.Empty()
+ pp.SkipTo(dateTimeKey | pp.StringEnd())("body"))
我将您的样品转换为具有不同的日期时间进行测试,我们得到了:
sample = """\
2/07/18 13:55:00.983: line of text
possible 2nd line of text
possible 3rd line of text...
2/07/19 13:55:00.983: line of text
2/07/20 13:55:00.983: line of text
possible 2nd line of text
possible 3rd line of text...
possible <n> line of text...
2/07/21 13:55:00.983: line of text
"""
print(pp.OneOrMore(logEntry).parseString(sample).dump())
礼物:
[['2/07/18', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n 2/07/19 13:55:00.983: line of text'], ['2/07/20', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n possible <n> line of text...'], ['2/07/21', '13:55:00.983', ':', 'line of text']]
[0]:
['2/07/18', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n 2/07/19 13:55:00.983: line of text']
- body: 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n 2/07/19 13:55:00.983: line of text'
- time: ['2/07/18', '13:55:00.983']
[1]:
['2/07/20', '13:55:00.983', ':', 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n possible <n> line of text...']
- body: 'line of text\n possible 2nd line of text\n possible 3rd line of text...\n possible <n> line of text...'
- time: ['2/07/20', '13:55:00.983']
[2]:
['2/07/21', '13:55:00.983', ':', 'line of text']
- body: 'line of text'
- time: ['2/07/21', '13:55:00.983']
我还必须将您的num_pair
转换为:
numPair = pp.Word(pp.nums, max=2)
否则,它与您的采样日期中的前两位数字“ 2”不匹配。