我需要解析包含平面文本的文件,并提取有效的IP地址和混淆的IP地址。
(即192.168.1 [。] 1或192.168.1(。)1或192.168.1 [dot] 1或192.168.1(dot)1或192.168.1.1)
提取数据后,我需要将它们全部转换为有效格式并删除重复数据。
我当前的代码将ip地址放入一个字符串,该字符串应该是一个字典?我知道我需要使用某种递归来设置键值,但我觉得有一种更有效和模块化的方法来完成任务。
import json, ordereddict, re
# define the pattern of valid and obfuscated ips
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
# open data file that contains ip addresses and other text
with open ("sample.txt", "r") as myfile:
text=myfile.read().replace('\n', '')
# put non normalized ip addresses in a dictionary
ips = {"data": [{"key1": match[0] for match in re.findall(pattern, text) }]}
# normalized ip addresses
for name, datalist in ips.iteritems():
for datadict in datalist:
for key, value in datadict.items():
if value == "(dot)":
datadict[key] = "."
if value == "[dot]":
datadict[key] = "."
if value == " . ":
datadict[key] = "."
if value == " .":
datadict[key] = "."
if value == ". ":
datadict[key] = "."
# write valid ip address to json file
with open('test.json', 'w') as outfile:
json.dump(ips, outfile)
示例数据文件
These are valid ip addresses 192.168.1.1, 8.8.8.8
These are obfuscated 192.168.2[.]1 or 192.168.3(.)1 or 192.168.1[dot]1
192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. or 192 . 168 . 1 . 1
This is what an invalid ip address looks like, they should be excluded 256.1.1.1 or 500.1.500.1 or 192.168.4.0
预期结果
192.168.1.1, 192.168.2.1, 192.168.3.1 , 8.8.8.8