如何使用前缀从文本文件中提取部分字符串

时间:2017-08-23 12:37:58

标签: python string prefix

我有一个文本文件,在某些区域包含以下字符串。

20170818_141903   Test ! Vdd 3.000000; P: 20.000000;T 20.282000;Part: 0; Baud Rate: 9620.009620; Message: MMS111111110001110100000000000100100000000000000000000000000100010000000000000000000001000000000010000000000001000000100000000010000011000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001001001110001010001000000000111011011001010110000000000000010000001101100000000000000000000011011111010000100111101000000000111111110000111110010110000000010001001101110000101000000000000110010010000000000000000000000000000000000001000000000000000001000000000010000001000000000000000000000000000000000000000000100010000000000000101010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010100101111111010111000000110100000000101000110000100010101010011010000000000000100010001100000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000SS 

不幸的是,它不是逗号或制表符分隔,每行都是一个大字符串。

我已阅读整个文件,并试图提取所有二进制数据。

这意味着我希望以下字符之间的所有内容

  

MMS ...... SS

我还想提取例如来自这些区域的P:或Vdd:之后的值

Vdd 3.000000; P: 20.000000...........................etc

我目前所做的事情:

import re

match = re.search(r'\P: (\w+)', LONG_STRING)
        if match:
            print match.group(1)

然而,这并没有提取完整的浮点数,它会忽略小数位

1 个答案:

答案 0 :(得分:1)

回答v2.0。总的来说,这段代码非常僵硬而且不是最清晰的代码,但目前我无法为您提供的样本提供更好的解决方案。

>>> import re

>>> that_long_row = "20170818_141903   Test ! Vdd 3.000$000; P: 20.000000;T 20.282000;Part: 0; Baud Rate: 9620.009620; Message: MMS111111110001110100000000000100100000000000000000000000000100010000000000000000000001000000000010000000000001000000100000000010000011000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001001001110001010001000000000111011011001010110000000000000010000001101100000000000000000000011011111010000100111101000000000111111110000111110010110000000010001001101110000101000000000000110010010000000000000000000000000000000000001000000000000000001000000000010000001000000000000000000000000000000000000000000100010000000000000101010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010100101111111010111000000110100000000101000110000100010101010011010000000000000100010001100000000110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000SS "

>>> regex = (r'^'                       # start of a string symbol
         r'.+'                          # escape any character
         r'Vdd '                        # until "Vdd " is reached
         r'(?P<Vdd>[0-9\.]+)'           # select a continuous sequence of numbers and dots folowing that word and assign it to a group "Vdd"
         r'.+'                          # again, skip some more chars
         r'P: '                         # find "P: " word
         r'(?P<P>[0-9\.]+)'             # select a continuous sequence of numbers and dots and assign to a group "P"
         r'.+'                          # the same goes for your byte "Message" between "MMS" and "SS" symbols
         r'MMS'
         r'(?P<Message>[0-1]+)'         # except that it only matches 0 and 1
         r'SS'
         r'.+'                          # as @Evan mentioned, you need this to escape some possible trailing symbols 
         r'$'                           # end of a string symbol
         )

# the same but in a compact form:
>>> regex = r'^.+Vdd (?P<Vdd>[0-9\.]+).+P: (?P<P>[0-9\.]+).+MMS(?P<Message>[0-1]+)SS.+$'

>>> match = re.match(regex, that_long_row)

# matching will form a groupdict that is like a normal dict
# and you can access any matched group value by its name

>>> match.groupdict()
{'Vdd': '3.000', 'P': '20.000000', 'Message': ...

接下来,如果你想以这种方式解析文件,我会创建一个简单的类来保存所有数据,类型转换,验证等。

class Message:
    def __init__(self, Vdd, P, Message):
        self.vdd = float(Vdd)
        self.p = float(P)
        self.text = Message

data = []

with open('yourfile', 'r') as f:
    for line in f:
       match = re.match(regex, line)
       try:
           data.append(Message(**match.groupdict()))
       except ValueError:
           data.append('CORRUPTED')

等等。