Question

我有一些以下格式的文本文件（由tcpdump收集的网络流量）：

1505372009.023944 00:1e:4c:72:b8:ae > 00:23:f8:93:c1:af, ethertype IPv4 (0x0800), length 97: (tos 0x0, ttl 64, id 5134, offset 0, flags [DF], proto TCP (6), length 83)
    192.168.1.53.36062 > 74.125.143.139.443: Flags [P.], cksum 0x67fd (correct), seq 1255996541:1255996572, ack 1577943820, win 384, options [nop,nop,TS val 356377 ecr 746170020], length 31
    0x0000:  0023 f893 c1af 001e 4c72 b8ae 0800 4500  .#......Lr....E.
    0x0010:  0053 140e 4000 4006 8ab1 c0a8 0135 4a7d  .S..@.@......5J}
    0x0020:  8f8b 8cde 01bb 4adc fc7d 5e0d 830c 8018  ......J..}^.....
    0x0030:  0180 67fd 0000 0101 080a 0005 7019 2c79  ..g.........p.,y
    0x0040:  a6a4 1503 0300 1a00 0000 0000 0000 04d1  ................
    0x0050:  c300 9119 6946 698c 67ac 47a9 368a 1748  ....iFi.g.G.6..H
    0x0060:  1c                                       .

并希望将其更改为：

1505372009.023944 
    000000:  00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00  .#......Lr....E.
    000010:  00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d  .S..@.@......5J}
    000020:  8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18  ......J..}^.....
    000030:  01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79  ..g.........p.,y
    000040:  a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1  ................
    000050:  c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48  ....iFi.g.G.6..H
    000060:  1c                                               .

这就是我所做的：

import re
regexp_time =re.compile("\d\d\d\d\d\d\d\d\d\d.\d\d\d\d\d\d+")
regexp_hex = re.compile("(\t0x\d+:\s+)([0-9a-f ]+)+  ")

with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt2.txt','w') as output:
    for line in input:
        if regexp_time.match(line):
            output.write ("%s\n" % (line.split()[0]))
        elif regexp_hex.match(line):
            words = re.split(r'\s{2,}', line)
            bytes=""
            for byte in words[1].split():
                if len(byte) == 4:
                    bytes += "%s%s %s%s "%(byte[0],byte[1],byte[2],byte[3])
                elif len(byte) == 2:
                    bytes += "%s%s "%(byte[0],byte[1])
            output.write ("%s  %s %s \n" % (words[0].replace("0x","00"),"{:<47}".format (bytes),words[2].replace("\n","")))

input.close()
output.close()

有人可以帮助我加快速度吗？

修改

这里的新版代码取决于@Austin的答案，它确实加快了代码的速度。

with open ('../Traffic/traffic1.txt') as input,open ('../Traffic/txt1.txt','w') as output:
for line in input:
    if line[0].isdigit():
        output.write (line[:16])
        output.write ('\n')
    elif line.startswith("\t0x"):#(Since there is line which is not hex and not start with timestamp I should check this as well)
        offset = line[:10]  # "    0x0000:  "
        words = line[10:51]  # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
        chars = line[51:]  # "  .#......Lr....E."
        line = [offset.replace('x', '0', 1)]
        for a,b,c,d,space in zip (words[0::5],words[1::5],words[2::5],words[3::5],words[4::5]):
            line.append(a)
            line.append(b)
            line.append(space)
            line.append(c)
            line.append(d)
            line.append(space)
        line.append (chars)
        output.write (''.join (line))
input.close()
output.close()

结果如下：

1505372009.02394
000000:  00 23 f8 93 c1 af 00 1e 4c 72 b8 ae 08 00 45 00 .#......Lr....E.
000010:  00 53 14 0e 40 00 40 06 8a b1 c0 a8 01 35 4a 7d .S..@.@......5J}
000020:  8f 8b 8c de 01 bb 4a dc fc 7d 5e 0d 83 0c 80 18 ......J..}^.....
000030:  01 80 67 fd 00 00 01 01 08 0a 00 05 70 19 2c 79 ..g.........p.,y
000040:  a6 a4 15 03 03 00 1a 00 00 00 00 00 00 00 04 d1 ................
000050:  c3 00 91 19 69 46 69 8c 67 ac 47 a9 36 8a 17 48 ....iFi.g.G.6..H
000060:  1c                                              .

Answer 1

您尚未指定有关文件格式的任何其他内容，包括在数据包数据块之间是否出现任何行。所以我假设你只有你所展示的那些段落，卡在一起。

加速这样的事情的最好方法是减少额外的操作。你有一堆！例如：

您使用正则表达式匹配“开始”行。
您使用拆分从起始行中提取时间戳。
使用％-format运算符写出时间戳。
您使用不同的正则表达式来匹配“十六进制”行。
您使用多个拆分来解析十六行。
使用各种格式运算符输出十六进制行。

如果您要使用正则表达式匹配，那么我认为您应该只进行一次匹配。创建描述两条线的备用模式（如a|b）。使用match.lastgroup或.lastindex来确定匹配的内容。

但你的行如此不同，我认为不需要正则表达式。基本上，你可以通过查看第一个字符来决定你有哪种线：

if line[0].isdigit():
    # This is a timestamp line
else:
    # This is a hex line

对于时间戳处理，您要做的就是在行的开头打印出17个字符：11位数，一个点和另外6位数字。那样做：

if line[0].isdigit():
    output.write(line[:17], '\n')

对于十六进制线处理，您希望进行两种更改：您希望将十六进制偏移中的“x”替换为零。这很简单：

    hexline = line.replace('x', '0', 1)   # Note: 1 replacement only!

然后，您希望在4个十六进制数字的组之间插入空格，并填充短线，以便字符显示出现在同一列中。

这是一个正常表达替换可能对您有帮助的地方。出现次数有限，但可能是Cpython解释器的开销高于正则表达式替换的设置和拆卸成本。您可能应该对此进行一些分析。

那就是说，你可以将线分成三部分。但是，捕捉中间部分的尾随空间非常重要：

offset = line[:13]   # "    0x0000:  "
words  = line[13:53] # "0023 f893 c1af 001e 4c72 b8ae 0800 4500 "
chars  = line[53:]   # "  .#......Lr....E."

您已经知道如何替换offset,中的'x'，并且无法对该行的chars部分进行任何操作。所以我们会留下那些人。剩下的任务是展开中的角色 words字符串。你可以用各种方式做到这一点，但似乎很容易处理5个块中的字符（4个十六进制数字加上一个尾随空格）。

我们可以这样做，因为我们捕获了words部分的尾随空格。如果没有，你可能不得不使用itertools.zip_longest(..., fill_value='')，但是再抓一个角色可能会更容易。

完成后，您可以：

for a,b,c,d,space in zip(words[0::5], words[1::5], words[2::5], words[3::5], words[4::5]):
    output.write(a, b, space, c, d, space)

或者，不是进行所有这些调用，而是可以在缓冲区中累积字符，然后再写一次缓冲区。类似的东西：

    line = [offset]
    for ...:
        line.extend(a, b, space, c, d, space)
    line.append(chars)
    line.append('\n')
    output.write(''.join(line))

这是相当简单的，但就像我说的，它可能不如常规表达式替换那么好。这可能是因为正则表达式代码运行为“C”而不是python字节码。因此，您应将其与模式替换进行比较，如：

words = re.sub(r'(..)(..) ', '\1 \2 ', words)

请注意，我不需要十六进制数字，以便使段落最后一行的任何尾随“填充”空格按比例扩展。再次，请检查上面的zip版本的性能！

加速python代码

1 个答案: