Python中的处理/提取

时间:2015-10-30 01:59:17

标签: python text concatenation extraction

我有一个包含如下所示数据的文本文件:

ACK
DATA1   < >
ACK
DATA1   < >
NAK
ACK
DATA1   < >
DATA0   < 20 >
ACK
DATA1   < 01 01 01 00 >
ACK
ACK
DATA1   < >
DATA1   < 20 >
ACK
DATA1   < >
ACK
ACK
ACK
ACK
ACK
ACK
ACK

ACK
DATA0   < 00 00 00 00 ff ff ff ff 00 00 00 01 ff ff ff fe 00 00 00 02 ff ff ff fd 00 00 00 03 ff ff ff fc
      00 00 00 08 ff ff ff f7 00 00 00 09 ff ff ff f6 00 00 00 0a ff ff ff f5 00 00 00 0b ff ff ff f4
      00 00 00 10 ff ff ff ef 00 00 00 11 ff ff ff ee 00 00 00 12 ff ff ff ed 00 00 00 13 ff ff ff ec
      00 00 00 18 ff ff ff e7 00 00 00 19 ff ff ff e6 00 00 00 1a ff ff ff e5 00 00 00 1b ff ff ff e4
      00 00 00 20 ff ff ff df 00 00 00 21 ff ff ff de 00 00 00 22 ff ff ff dd 00 00 00 23 ff ff ff dc
      00 00 00 28 ff ff ff d7 00 00 00 29 ff ff ff d6 00 00 00 2a ff ff ff d5 00 00 00 2b ff ff ff d4
      00 00 00 30 ff ff ff cf 00 00 00 31 ff ff ff ce 00 00 00 32 ff ff ff cd 00 00 00 33 ff ff ff cc
      00 00 00 38 ff ff ff c7 00 00 00 39 ff ff ff c6 00 00 00 3a ff ff ff c5 00 00 00 3b ff ff ff c4
      00 00 00 40 ff ff ff bf 00 00 00 41 ff ff ff be 00 00 00 42 ff ff ff bd 00 00 00 43 ff ff ff bc
      00 00 00 48 ff ff ff b7 00 00 00 49 ff ff ff b6 00 00 00 4a ff ff ff b5 00 00 00 4b ff ff ff b4
      00 00 00 50 ff ff ff af 00 00 00 51 ff ff ff ae 00 00 00 52 ff ff ff ad 00 00 00 53 ff ff ff ac
      00 00 00 58 ff ff ff a7 00 00 00 59 ff ff ff a6 00 00 00 5a ff ff ff a5 00 00 00 5b ff ff ff a4
      00 00 00 60 ff ff ff 9f 00 00 00 61 ff ff ff 9e 00 00 00 62 ff ff ff 9d 00 00 00 63 ff ff ff 9c
      00 00 00 68 ff ff ff 97 00 00 00 69 ff ff ff 96 00 00 00 6a ff ff ff 95 00 00 00 6b ff ff ff 94
      00 00 00 70 ff ff ff 8f 00 00 00 71 ff ff ff 8e 00 00 00 72 ff ff ff 8d 00 00 00 73 ff ff ff 8c
      00 00 00 78 ff ff ff 87 00 00 00 79 ff ff ff 86 00 00 00 7a ff ff ff 85 00 00 00 7b ff ff ff 84 >
DATA1   < 01 01 01 01 fe fe fe fe 00 00 01 00 ff ff fe ff 00 00 02 00 ff ff fd ff 00 00 03 00 ff ff fc ff
      00 00 08 00 ff ff f7 ff 00 00 09 00 ff ff f6 ff 00 00 0a 00 ff ff f5 ff 00 00 0b 00 ff ff f4 ff
      00 00 10 00 ff ff ef ff 00 00 11 00 ff ff ee ff 00 00 12 00 ff ff ed ff 00 00 13 00 ff ff ec ff
      00 00 18 00 ff ff e7 ff 00 00 19 00 ff ff e6 ff 00 00 1a 00 ff ff e5 ff 00 00 1b 00 ff ff e4 ff
      00 00 20 00 ff ff df ff 00 00 21 00 ff ff de ff 00 00 22 00 ff ff dd ff 00 00 23 00 ff ff dc ff
      00 00 28 00 ff ff d7 ff 00 00 29 00 ff ff d6 ff 00 00 2a 00 ff ff d5 ff 00 00 2b 00 ff ff d4 ff
      00 00 30 00 ff ff cf ff 00 00 31 00 ff ff ce ff 00 00 32 00 ff ff cd ff 00 00 33 00 ff ff cc ff
      00 00 38 00 ff ff c7 ff 00 00 39 00 ff ff c6 ff 00 00 3a 00 ff ff c5 ff 00 00 3b 00 ff ff c4 ff
      00 00 40 00 ff ff bf ff 00 00 41 00 ff ff be ff 00 00 42 00 ff ff bd ff 00 00 43 00 ff ff bc ff
      00 00 48 00 ff ff b7 ff 00 00 49 00 ff ff b6 ff 00 00 4a 00 ff ff b5 ff 00 00 4b 00 ff ff b4 ff
      00 00 50 00 ff ff af ff 00 00 51 00 ff ff ae ff 00 00 52 00 ff ff ad ff 00 00 53 00 ff ff ac ff
      00 00 58 00 ff ff a7 ff 00 00 59 00 ff ff a6 ff 00 00 5a 00 ff ff a5 ff 00 00 5b 00 ff ff a4 ff
      00 00 60 00 ff ff 9f ff 00 00 61 00 ff ff 9e ff 00 00 62 00 ff ff 9d ff 00 00 63 00 ff ff 9c ff
      00 00 68 00 ff ff 97 ff 00 00 69 00 ff ff 96 ff 00 00 6a 00 ff ff 95 ff 00 00 6b 00 ff ff 94 ff
      00 00 70 00 ff ff 8f ff 00 00 71 00 ff ff 8e ff 00 00 72 00 ff ff 8d ff 00 00 73 00 ff ff 8c ff
      00 00 78 00 ff ff 87 ff 00 00 79 00 ff ff 86 ff 00 00 7a 00 ff ff 85 ff 00 00 7b 00 ff ff 84 ff >

此数据是部分USB流量日志,将用作比较由C程序即时生成的数据的黄金标准,不幸的是,黄金标准发生了变化,我希望能够灵活地生成新的来自交通日志的结构。

换句话说,我想用Python来生成我将在我的C程序中使用的结构。我需要将此数据转换为包含转换为等效十六进制值(ACK = 0xD2DATA1 = 0x4B等)和数据(<01 01 01>)的结构的结构。

我最挣扎的部分是数据是多行时,例如:

DATA0 < 00 00 00 00...ff ff ff fc 
        00 00 00 00...ff ff ff f4
        ....
        00 00 00 00...ff ff ff 84 > 

我还没有找到一种方法来连接这些行并将它们放在它们自己的行中,如下所示:

DATA0 < 00 00 00 00...ff ff ff 84 >

一旦数据在一行中,我知道我可以使用split()方法来提取感兴趣的部分。

3 个答案:

答案 0 :(得分:1)

这可能是一种更流畅的方式,但是如果您的数据位于&data; .txt&#39;

中,那么就可以做到这一点。
with open('data.txt', 'rt') as fobj:
    lines = []
    in_data_line = False
    for line in fobj:
        line = line.rstrip('\n')
        lines.append(line)
        if not in_data_line and line.startswith('DATA') and not line.endswith('>'):
            in_data_line = True
        if in_data_line and line.endswith('>'):
            in_data_line = False
        if not in_data_line:
            lines.append('\n')
# lines now has DATA lines joined
print(''.join(lines))

答案 1 :(得分:0)

我是你,这就是我要做的。放入那些多行数据后,用空格替换行开头的双制表。然后连接(或加入)所有这些。

答案 2 :(得分:0)

只是一个骨架。它不会连接行,它会将整个文本拆分为单词,然后在尖括号之间重建数据列表。我希望结果数据易于处理。

def lex(file):
    in_data = False
    with open(file) as infile:
        for line in infile:
            for word in line.split():
                if not in_data:
                    if word == '<':
                        data_list = []
                        in_data = True
                    else:
                        # process ACK, NAK, DATA, ....
                        yield word
                else:
                    if word == '>':
                        in_data = False
                        yield data_list
                    else:
                        data_list.append(int(word, 16))

print(list(lex('data.txt')))

输出(缩短):

  

['ACK','DATA1',[],'ACK','DATA1',[],'NAK','ACK','DATA1',[],   'DATA0',[32],'ACK','DATA1',[1,1,1,0],'ACK','ACK','DATA1',   [],'DATA1',[32],'ACK','DATA1',[],'ACK','ACK','ACK','ACK',   'ACK','ACK','ACK','ACK','DATA0',[0,0,0,0,255,255,255,255,   0,0,0,1,255,255,255,254,0,0,0,2,255,255,255,253,0,0,   0,3,255,255,255,252,0,0,0,8,255,255,255,247,0,0,0,9,   255,255,255,246,0,0,0,105,255,255,245,0,0,0,11,255,   255,255,244,0,0,0,16,255,...... 255]]