Question

我在一个文件中有大约40,000行信息，我想使用Python 3.4提取某个系统的IP地址。该文件分为每个块，以＆＃34; lease＆＃34;开头。并以＆＃34;}＆＃34;结束。我想搜索＆＃34; SYSTEM123456789＆＃34;并提取IP地址＆＃34; 10.0.0.2＆＃34;。我该如何做，以及首选方法是什么？

1）读入文件，在列表中将其分解，然后搜索？
2）复制文件，然后在该文件中搜索？

Tax       Description             ML   Total    Link  Link      Link     Link      Link     Link
Code                                   Rate     Code  Rate               Rate      Code     Rate

SC001     Abbeville County, SC    Y    0.0700   SC    0.0600    SCLO1    0.0100
SC002     Aiken County, SC        Y    0.0800   SC    0.0600    SCCP1    0.0100    SCEC1    0.0100

Answer 1

您可以使用lease作为分隔符对groupby进行分组：

from itertools import groupby

def find_ip(s, f):
    with open(f) as f:
        grouped = groupby(f, key=lambda x: x.startswith("lease "))
        for k, v in grouped:
            if k: # v is the lease line
                # get ip from lease line
                ip = next(v).rstrip().split()[1]
                # call next to get next element from our groupby object 
                # which is each section after lease 
                val = list(next(grouped)[1])[-2]
                # check for substring
                if val.find(s) != -1:
                    return ip.rstrip("{")
    return "No match"

使用输入文件：

In [5]: find_ip('"SYSTEM123456789"',"in.txt")
Out[5]: '10.0.0.2'

x.startswith("lease ")作为groupby的关键字将文件拆分为多个部分，if k为True我们有一行lease所以我们提取ip然后检查租约中的倒数第二行如果我们找到子串，则返回ip。

文件被分成几行，如下所示：

['  starts 1 2015/06/29 07:22:01;\r\n', '  ends 2 2015/06/30 07:22:01;\r\n', '  tstp 2 2015/06/30 07:22:01;\r\n', '  cltt 1 2015/06/29 07:22:01;\r\n', '  binding state active; \r\n', '  next binding state free;\r\n', '  hardware ethernet 08:2e:5f:f0:8b:a1;\r\n', '}\r\n']
['  starts 1 2015/06/29 07:31:20;\r\n', '  ends 2 2015/06/30 07:31:20;\r\n', '  tstp 2 2015/06/30 07:31:20;\r\n', '  cltt 1 2015/06/29 07:31:20;\r\n', '  binding state active; \r\n', '  next binding state free;\r\n', '  hardware ethernet ec:b1:d7:87:6f:7a;\r\n', '  uid "\\001\\354\\261\\327\\207oz";\r\n', '  client-hostname "SYSTEM123456789";\r\n', '}']

你可以看到倒数第二个元素是client-hostname所以我们每次都提取它并搜索子串。

如果子字符串可以出现在任何地方，您可以使用任何字符串并检查每一行：

def find_ip(s, f):
    with open(f) as f:
        grouped = groupby(f, key=lambda x: x.startswith("lease "))
        for k, v in grouped:
            if k: # v is the lease line
                # get ip from lease line
                ip = next(v).rstrip().split()[1]
                # call next to get next element from our groupby object
                # which is each section after lease
                val = next(grouped)[1]
                # check for substring
                if any(sub.find(s) != -1 for sub in val):
                    return ip.rstrip("{")
    return "No match"

当您找到以＆＃34; lease＆＃34;开头的行时，您可以应用相同的逻辑，仅使用外部内部循环迭代文件对象。开始内部循环，直到找到子字符串并返回ip或者当你点击}信号表示该部分的结尾时打破内部循环。

def find_ip(s, f):
    with open(f) as f:
        for line in f:
            if line.startswith("lease "):
                ip = line.rstrip().split()[1]
                for n_line in f:
                    if n_line.find(s) != -1:
                        return ip.rstrip("{")
                    if n_line.startswith("}"):
                        break
    return "No match"

输出：

In [9]: find_ip('"SYSTEM123456789"',"in.txt")
Out[9]: '10.0.0.2'

这两种方法都不涉及在任何时候在内存中存储多个线段。

Answer 2

关于@Ijk提到的内容，我想出了这个。

import re

find_ip = False

with open(f) as f:
    for line in f:
        mat = re.match(r'lease ([0-9]*.[0-9]*.[0-9]*.[0-9]*).*', line, re.M)
        if mat:
            ip = mat.group(1)
        mat = re.match(r'.* ("SYSTEM123456789").*', line, re.M)
        if mat:
            print(ip)

OP要求采用优先方法，这是我的，尽管我并不是最好的正则表达式。不过，我认为这正是OP所寻求的。

我更改了ip地址的正则表达式，因此它可以找到随机ip＆＃39; s并且只有在找到SYSTEM名称时才打印ip

从文件中提取信息

2 个答案: