我正在解析许多大文件,并希望确保尽可能高效地完成它。我正在解析的其中一行看起来像这样(Windows安全事件日志4624):
Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111
我想知道的是,从该行中提取多个字段的最有效方法是什么?我可以反复分割线,直到我到达我感兴趣的每个区域,但我觉得反复循环线是浪费时间/资源。
是否有一种智能的方式来查看该行只有一次但是拉出来,例如,以下字段:
LogonType = 3
TargetUserName = johndoe
TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111
举个例子,我能做的是重复以下过程:
part = line.partition('TargetUserName = ')[2]
username = part.partition(' ')[0]
获取我想要的每个字段(上面的示例仅为我提供用户名),但对我来说再次感觉效率低下。
有没有更好的方法来处理它?</ p>
答案 0 :(得分:2)
每个字段名称都是一组大写和小写字符。它们的值与=
分开。每个值都是一组非空白字符。您可以将re.findall
与匹配的组一起使用来查找所有&#34;字母=非空白&#34;实例。这将为您提供list
tuple
个,您可以保存或迭代并传递给格式字符串:
>>> s = '''Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111 '''
>>> import re
>>> for item in re.findall(r'([A-Za-z]+) = (\S+)', s):
... print('{} = {}'.format(*item))
...
SubjectUserSid = S-1-0-0
SubjectUserName = -
SubjectDomainName = -
SubjectLogonId = 0x0000000000000000
TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111
TargetUserName = johndoe
TargetDomainName = TestDomain
TargetLogonId = 0x0000000001111111
LogonType = 3
LogonProcessName = NtLmSsp
AuthenticationPackageName = NTLM
WorkstationName = TestWorkstation
LogonGuid = {00000000-0000-0000-0000-000000000000}
TransmittedServices = -
LmPackageName = NTLM
KeyLength = 128
ProcessId = 0x0000000000000000
ProcessName = -
IpAddress = 1.1.1.1
IpPort = 11111
您还可以将其转换为字典以便于访问:
>>> d = dict(re.findall(r'([A-Za-z]+) = (\S+)', s))
>>> d['LogonType']
'3'
答案 1 :(得分:1)
st = 'Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111';
using re module and re.findall you can I think get want you want
import re
li = re.findall(r'LogonType\s*=\s*\d+|TargetUserName\s*=\s*\w+|TargetUserSid\s*=\s*\w-.*?\s',st,re.MULTILINE| re.DOTALL)
>>>li
['TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 ', 'TargetUserName = johndoe', 'LogonType = 3']