我有一个需要分析的文本文件。文件中的每一行都是以下形式:
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj@nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3
我需要跳过时间戳和(slbfd)
,并且只保留IN和OUT的行数。此外,根据引号中的名称,如果一行以OUT
开头,我需要增加不同变量的变量计数,否则减少变量计数。我将如何在Python中执行此操作?
答案 0 :(得分:5)
使用正则表达式和分割行的其他答案将完成工作,但如果您想要一个可随之增长的完全可维护的解决方案,则应构建语法。我爱pyparsing
:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj@nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
这给出了输出:
lq_viz_server 1
OFM32 -1
如果您的示例日志文件较长,那将会更令人印象深刻。 pyparsing解决方案的优点是能够适应未来更复杂的查询(例如,抓取并解析时间戳,拉取电子邮件地址,解析错误代码......)。这个想法是你编写独立于查询的语法 - 你只需将原始文本转换为计算机友好格式,抽象出解析实现,远离它的用法。
答案 1 :(得分:1)
您有两种选择:
.split()
的{{1}}功能(如评论中所述)re
module用于正则表达式。我建议使用string
模块并创建一个带有命名组的模式。
配方:
re
包含命名组re.compile()
循环以使行使用for
.match()
答案 2 :(得分:1)
如果我认为文件被分成几行(我不知道是否属实),你必须对每一行应用split()
函数。你会有这个:
["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela@nabltas1"]
然后我认为你必须能够应用任何逻辑来比较你需要的值。
答案 3 :(得分:1)
我对您的规范做了一些疯狂的假设,这里有一个示例代码可以帮助您开始:
objects = {}
with open("data.txt") as data:
for line in data:
if "IN:" in line or "OUT:" in line:
try:
name = line.split("\"")[1]
except IndexError:
print("No double quoted name on line: {}".format(line))
name = "PARSING_ERRORS"
if "OUT:" in line:
diff = 1
else:
diff = -1
try:
objects[name] += diff
except KeyError:
objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names
答案 4 :(得分:0)
在使用标准发行版 get'er> 的模式中,这有效:
import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
if match:
if match.group(1) == 'IN': count[match.group(2)]+=1
elif match.group(1) == 'OUT': count[match.group(2)]-=1
print(count)
打印:
Counter({'lq_viz_server': 1, 'OFM32': -1})