尽快更改每个IP和mac地址(巨大的日志文件)

时间:2018-10-19 08:21:54

标签: python regex replace

我已在约5GB的日志文件中收集了身份验证过程。 现在,我想更改数据的所有部分,这使得可以确定原始数据的来源,因为它将用作机器学习的训练数据(并可能已发布)。

由于必须保留数据中的逻辑,因此我想到了使用模运算符更改IP和MAC地址。但是我不知道,如何(快速)用python(重新)替换它们。

我的第一个尝试是使用re.search,将找到的IP分为4部分,并使用不同的模运算符更改每个部分。 发生问题的地方: - 它很丑 -很慢 -仅在第一个比赛中使用

有人知道解决这个问题的好方法吗?

______ EDIT _____

示例日志:

RID:“ 700011”; RL:“ 1”; RG:“ windows,authentication_failures”; RC:“已请求Kerberos身份验证票证:失败。”; USER:“(无用户)”; SRCIP:“无”;主机名:“(boatyMcBoatface)10.19.18.1-> WinEvtLog”;位置:“(boatyMcBoatface)10.19.18.1-> WinEvtLog”;事件:“ [INIT] 2018 Aug 01 01:59:40 WinEvtLog:安全性:AUDIT_FAILURE(4768):Microsoft-Windows-Security-Auditing:(无用户):无域:boatyMcBoatface.haven.ssh:Kerberos身份验证票证(帐户信息:帐户名:BackupNow提供的领域名称:haven.ssh用户ID:S-1-0-0服务信息:服务名称:krbtgt / haven.ssh服务ID:S-1-0-0网络信息:客户端地址::: ffff:10.15.16.166客户端端口:53680附加信息:票证选项:0x40810010结果代码:0x17票证加密类型:0xffffffff预身份验证类型:-证书信息:证书发行者名称:证书序列号:证书指纹:仅在证书用于预认证时才提供证书信息。预认证类型,票证选项,加密类型和结果代码在RFC 4120中定义。[END]“; 'plugin_sid ='700011'proto ='6'ctx ='192222c3-2222-22222222-422222226754'src_host =''dst_host =''src_net ='19111112c3-2222-22222222-422222226754'dst_net ='333333a8-f526-1356- bbbe-005022285e074'username ='BackupNow'userdata1 ='1'userdata2 ='windows,authentication_failures,'userdata3 ='已请求Kerberos身份验证票证:失败。 userdata4 ='krbtgt / haven.ssh'userdata5 ='0x17'userdata6 ='0xffffffff'userdata7 ='-'userdata9 ='haven.ssh'device = '10 .19.18.1'/> ost_dst ='boatyMcBoatface'idm_mac_src = '12: E4:B1:2B:B3:BB'idm_mac_dst = '12:E4:B1:2B:B3:BB'device = '10 .19.19.23'/>

RID:“ 700003”; RL:“ 5”; RG:“窗户”; RC:“ Windows网络登录”; USER:“ evservice”; SRCIP:“ 10.3.3.39”;主机名:“(boatyMcBoatface)10.19.19.23-> WinEvtLog”;位置:“(boatyMcBoatface)10.19.19.23-> WinEvtLog”;事件:“ [INIT] 2018 Aug 01 01:59:37 WinEvtLog:安全:AUDIT_SUCCESS(4624):Microsoft-Windows-Security-Auditing:evservice:SSI-LOG:boatyMcBoatface.haven.ssh:一个帐户已成功登录。主题:安全ID:S-1-0-0帐户名:-帐户域:-登录ID:0x0登录类型:3新登录:安全ID:S-1-5-21-88886292-694438636-1307214239-9687帐户名称:myservice帐户域:MY-LOG登录ID:0x226aa299c6登录GUID:{0354E718-498F-039C-83C2-725752D013BE}进程信息:进程ID:0x0进程名称:-网络信息:工作站名称:源网络地址:10.3。 3.39源端口:61266详细的身份验证信息:登录过程:Kerberos身份验证软件包:Kerberos传输服务:-软件包名称(仅NTLM):-密钥长度:0在创建登录会话时生成此事件,该事件在计算机上生成。 [END]“; 'plugin_sid ='700003'proto ='6'ctx ='584a8883-a333-22a6-adde-000000876224'src_host =''dst_host ='aaaaaaa-2ebf-e2ea-eee-e053079999ed'src_net ='555555-f226-11e6- bbbb-005056876974'dst_net ='666666de-2be4-8242-1d75-45b6aaaaaaaa'username ='myservice'userdata1 ='5'userdata2 ='windows,'userdata3 ='Windows Network Logon'userdata4 ='4624'userdata5 ='3' userdata6 ='MY-LOG'userdata7 ='0x226cb22322'userdata8 ='-'idm_host_dst ='boatyMcBoatface'idm_mac_dst ='A1:15:14:AB:1C:1D'device = '10 .19.19.23'/>

RID:“ 700014”; RL:“ 1”; RG:“ windows,authentication_failures”; RC:“ Kerberos用户预身份验证失败。”; USER:“(无用户)”; SRCIP:“无”;主机名:“((my-dc02)22.22.65.6-> WinEvtLog”;位置:“((my-dc02)22.22.65.6-> WinEvtLog”;事件:“ [INIT] 2018 Aug 01 09:04:50 WinEvtLog:安全性:AUDIT_FAILURE(4771):Microsoft-Windows-Security-Auditing:(无用户):无域:my-dc02.my.ssh:Kerberos pre-帐户信息:安全ID:S-1-5-21-1993962763-602162358-1801674531-2146帐户名称:sys-dobackup服务信息:服务名称:krbtgt / gb网络信息:客户端地址::: ffff:22.22 .1.1客户端端口:61391附加信息:票证选项:0x40810010失败代码:0x18预身份验证类型:2证书信息:证书颁发者名称:证书序列号:证书指纹:仅当证书用于预认证时才提供证书信息预身份验证类型,票证选项和故障代码在RFC 4120中定义。如果票证在运输过程中格式错误或损坏并且无法解密,则可能不会出现此事件中的许多字段。[END]“; 'plugin_sid ='700014'proto ='6'ctx ='aaaaaaa-e2cf-12a9-9c1f-288888a5c27'src_host ='aaaaaa3-ff38-22e6-b718-01544442f94'dst_host ='55555ec3-ff20-5515-8059-0011111a2b4' src_net ='a6d1111d-7111-811d-f35-f4ea131269107'dst_net ='44449bea-960c-4446-6f444-d4444f159b8'username ='sys-dobackup'userdata1 ='1'userdata2 ='windows,authentication_failures,'userdata3 ='Kerberos用户预身份验证失败。” userdata4 ='4771'userdata5 ='2'userdata6 ='krbtgt / gb'userdata7 ='0x18'idm_host_src ='do-dc01'idm_host_dst ='my-dc02'idm_mac_src = '11:30:22:37:33:63 'idm_mac_dst = '22:21:56:44:14:21'device = '22 .22.65.6'/>

____ EDIT_2 ___

示例:

_____之前____

1 date time src_ip=192.168.1.1 dst_ip=192.168.1.2 msg
2 date time src_ip=192.168.1.1 dst_ip=192.168.1.3 msg
3 date time src_ip=192.168.1.9 dst_ip=192.168.1.2 msg

_____之后_____

1 date time src_ip=1.168.1.2 dst_ip=1.168.1.3 msg
2 date time src_ip=1.168.1.2 dst_ip=1.168.1.4 msg
3 date time src_ip=1.168.1.10 dst_ip=1.168.1.3 msg

我的垃圾代码:

import re
file = "C:\Users\Hank\Desktop\Huge.log"
file2 = "C:\Users\Hank\Desktop\Huge2.log"

searchstring = "some_regex_magic"
with open(file) as f:
    for line in f:            
        result = re.findall(searchstring, line)

        if result:
            ip = old_ip+anonymize_em_all
            #No Idea, how to add them back into the string at the correct postion
            #replace  them directly maybe, without writing a new file ?
            res2 ="+ip+\n"
            with open(file2,"a") as myfile:
            myfile.write(res2)
            myfile.close()

最好的问候

1 个答案:

答案 0 :(得分:1)

尝试使用下面的代码,其边缘粗糙,但进行替换。

import re

input=["1 date time src_ip=192.168.1.1 dst_ip=192.168.1.2 msg",
"2 date time src_ip=192.168.1.1 dst_ip=192.168.1.3 msg",
"3 date time src_ip=192.168.1.9 dst_ip=192.168.1.2 msg"]

for line in input:
    print re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}","x.x.x.x",line) 

示例输出:

1 date time src_ip=x.x.x.x dst_ip=x.x.x.x msg
2 date time src_ip=x.x.x.x dst_ip=x.x.x.x msg
3 date time src_ip=x.x.x.x dst_ip=x.x.x.x msg

希望这会有所帮助!欢呼!