Question

我需要解析包含平面文本的文件，并提取有效的IP地址和混淆的IP地址。

（即192.168.1 [。] 1或192.168.1（。）1或192.168.1 [dot] 1或192.168.1（dot）1或192.168.1.1）

提取数据后，我需要将它们全部转换为有效格式并删除重复数据。

我当前的代码将ip地址放入一个字符串，该字符串应该是一个字典？我知道我需要使用某种递归来设置键值，但我觉得有一种更有效和模块化的方法来完成任务。

import json, ordereddict, re

# define the pattern of valid and obfuscated ips
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"

# open data file that contains ip addresses and other text
with open ("sample.txt", "r") as myfile:
    text=myfile.read().replace('\n', '')

# put non normalized ip addresses in a dictionary
ips = {"data": [{"key1": match[0] for match in re.findall(pattern, text) }]}

# normalized ip addresses
for name, datalist in ips.iteritems():
    for datadict in datalist:
        for key, value in datadict.items():
            if value == "(dot)":
                datadict[key] = "."
            if value == "[dot]":
                datadict[key] = "."
            if value == " . ":
                datadict[key] = "."
            if value == " .":
                datadict[key] = "."
            if value == ". ":
                datadict[key] = "."

# write valid ip address to json file
with open('test.json', 'w') as outfile:
    json.dump(ips, outfile)

示例数据文件

These are valid ip addresses 192.168.1.1, 8.8.8.8
These are obfuscated 192.168.2[.]1 or 192.168.3(.)1 or 192.168.1[dot]1
192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. or 192 . 168 . 1 . 1
This is what an invalid ip address looks like, they should be excluded 256.1.1.1 or 500.1.500.1 or 192.168.4.0

预期结果

192.168.1.1, 192.168.2.1, 192.168.3.1 , 8.8.8.8

在python词典中反混淆ip地址

0 个答案: