我正在开发一种可以从某些文本文件中提取字段值对的工具。直到现在,我正在使用Windows机器。当我在linux上测试该工具时,我正在增加的字段数量。
正则表达式:
([^,;\n\v{}<>\t=:\[\]\"\']+?)[=:][ \t]*(?:\"((?:[^\"\\]|\\.)*)\"|([^\t =:\"\n\v\t\{\}\[\]<>](?!(?!,)\S+[:=])(?:[^\n\v\t\{\}\[\]=:<>](?!(?!,)\S+[:=]))*))
示例文件:
05/02/2011 03:47:12 PM
LogName=Security
SourceName= ##Source_Name##
EventCode=4624
EventType=##Event_Type##
Type=##Type##
ComputerName=##Computer_Name##
TaskCategory=##Task_Category##
OpCode=##OpCode##
RecordNumber=##Record_Number##
Keywords=##Keyword_Success##
Message=An account was successfully logged on.
January 05-11 03:47:12 PM
Subject:
Security ID: ##Domain##\SYSTEM
Account Name: ##Computer_Name##
Account Domain: ##Domain##
Logon ID: 0x##System_Logon_Id##
Jan 27 03:47:12 PM
Logon Information:
Logon Type: ##Logon_Type##
Restricted Admin Mode: ##Restricted_Admin_Mode##
Virtual Account: ##Virtual_Account##
Elevated Token: ##Elevated_Token##
Impersonation Level: ##Impersonation_Level##
New Logon:
Security ID: ##Domain##\##User_Name##
Account Name: ##User_Name##
Account Domain: ##Domain##
Logon ID: 0x##Logon_Id##
Linked Logon ID: ##Linked_Logon_Id##
Network Account Name: ##User_Name2##
Network Account Domain: ##Domain2##
Logon ##GUID##: ##Logon_Guid##
Process Information:
Process ID: 0x##Process_Id##
Process Name: ##Process_Name##
Network Information:
Workstation Name: ##Computer_Name##
Source Network Address: ##Network_Ip##
Source Port: ##Network_Port##
Detailed Authentication Information:
Logon Process: ##Logon_Process##
Authentication Package: ##Authentication_Package##
Transited Services: ##Transited_Services##
Package Name (NTLM only): ##Package_Name##
Key Length: ##Key_Length##
使用Python 2.7.14在Windows中运行re.findall()时的输出:
[('time_field', '05/02/2011 03:47:12 PM'), ('time_field', 'January 05-11 03:47:12 PM'), ('time_field', 'Jan 27 03:47:12 PM'), ('LogName', 'Security'), ('SourceName', '##Source_Name##'), ('EventCode', '4624'), ('EventType', '##Event_Type##'), ('Type', '##Type##'), ('ComputerName', '##Computer_Name##'), ('TaskCategory', '##Task_Category##'), ('OpCode', '##OpCode##'), ('RecordNumber', '##Record_Number##'), ('Keywords', '##Keyword_Success##'), ('Message', 'An account was successfully logged on.'), ('Security ID', '##Domain##\\SYSTEM'), ('Account Name', '##Computer_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##System_Logon_Id##'), ('Logon Type', '##Logon_Type##'), ('Restricted Admin Mode', '##Restricted_Admin_Mode##'), ('Virtual Account', '##Virtual_Account##'), ('Elevated Token', '##Elevated_Token##'), ('Impersonation Level', '##Impersonation_Level##'), ('Security ID', '##Domain##\\##User_Name##'), ('Account Name', '##User_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##Logon_Id##'), ('Linked Logon ID', '##Linked_Logon_Id##'), ('Network Account Name', '##User_Name2##'), ('Network Account Domain', '##Domain2##'), ('Logon ##GUID##', '##Logon_Guid##'), ('Process ID', '0x##Process_Id##'), ('Process Name', '##Process_Name##'), ('Workstation Name', '##Computer_Name##'), ('Source Network Address', '##Network_Ip##'), ('Source Port', '##Network_Port##'), ('Logon Process', '##Logon_Process##'), ('Authentication Package', '##Authentication_Package##'), ('Transited Services', '##Transited_Services##'), ('Package Name (NTLM only)', '##Package_Name##'), ('Key Length', '##Key_Length##')]
使用Python 2.7.6在Linux中运行时的输出:
[('time_field', '05/02/2011 03:47:12 PM'), ('time_field', 'January 05-11 03:47:12 PM'), ('time_field', 'Jan 27 03:47:12 PM'), ('LogName', 'Security'), ('SourceName', '##Source_Name##'), ('EventCode', '4624'), ('EventType', '##Event_Type##'), ('Type', '##Type##'), ('ComputerName', '##Computer_Name##'), ('TaskCategory', '##Task_Category##'), ('OpCode', '##OpCode##'), ('RecordNumber', '##Record_Number##'), ('Keywords', '##Keyword_Success##'), ('Message', 'An account was successfully logged on.'), ('Subject', ''), ('Security ID', '##Domain##\\SYSTEM'), ('Account Name', '##Computer_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##System_Logon_Id##'), ('Logon Information', ''), ('Logon Type', '##Logon_Type##'), ('Restricted Admin Mode', '##Restricted_Admin_Mode##'), ('Virtual Account', '##Virtual_Account##'), ('Elevated Token', '##Elevated_Token##'), ('Impersonation Level', '##Impersonation_Level##'), ('New Logon', ''), ('Security ID', '##Domain##\\##User_Name##'), ('Account Name', '##User_Name##'), ('Account Domain', '##Domain##'), ('Logon ID', '0x##Logon_Id##'), ('Linked Logon ID', '##Linked_Logon_Id##'), ('Network Account Name', '##User_Name2##'), ('Network Account Domain', '##Domain2##'), ('Logon ##GUID##', '##Logon_Guid##'), ('Process Information', ''), ('Process ID', '0x##Process_Id##'), ('Process Name', '##Process_Name##'), ('Network Information', ''), ('Workstation Name', '##Computer_Name##'), ('Source Network Address', '##Network_Ip##'), ('Source Port', '##Network_Port##'), ('Detailed Authentication Information', ''), ('Logon Process', '##Logon_Process##'), ('Authentication Package', '##Authentication_Package##'), ('Transited Services', '##Transited_Services##'), ('Package Name (NTLM only)', '##Package_Name##'), ('Key Length', '##Key_Length##')]
在linux中额外生成的字段,而不是在windows中生成的字段:
('Subject', '')
('Logon Information', '')
('Network Information', '')
('Detailed Authentication Information', '')
在这里,我的困惑是:
请注意: 在这里,我的问题不在于正则表达式是对还是错。 Coz调试正则表达式可能会花费更多时间,并且它不是那么优化和整洁。这只是差异的原因以及我实际应该记住的内容。
更新:https://regex101.com/r/rDkBxN/1给出与Windows相同的结果。
答案 0 :(得分:0)
问题不在于Code或regex,而在于Windows和Linux中的文件读写差异。
在Windows中,文本文件中的传统行分隔符是CR,后跟LF。
在Unix和Linux中,文本文件中的传统行分隔符是LF。
我写的代码:它首先读取文件然后用给定的正则表达式提取。所以问题在于在linux中读取一个Windows文件。
用'\r\n'
替换'\n'
解决了这个问题。
代码:
f = open("sample.txt","r")
file_txt = f.read().replace('\r\n','\n')
f.close()