Question

所以我正在研究使用正则表达式解析代码，并且想知道是否有比我到目前为止更简单的方法。我将从一个我要解析的字符串示例开始：

T16F161A286161990200040000\r（它通过串行设备传输的数据）

现在首先我需要检查确认码，这是代码的前9个字符。他们需要完全T16F161A2。如果这9个字符完全匹配，我需要检查接下来需要861或37F的3个字符。

如果这3个字符是37F，我会做一些我需要编码的事情，所以我们不会担心这个结果。

但是，如果这3个字符为861，我需要检查后面的2个字符并查看它们是什么。它们可以是11，14，60，61，F0，F1或F2。其中每一个都与前面的数据做了不同的事情。

最后，我需要遍历其余的角色，将每个角色配对在一起。

关于这是如何工作的一个例子，这里是我一起抛出的代码，用于解析我在上面发布的示例字符串：

import re

test_string = "T16F161A286161990200040000\r"

if re.match('^T16F161A2.*', test_string):
    print("Match: ", test_string)
    test_string = re.sub('^T16F161A2', '', test_string)
    if re.match('^861.*', test_string):
        print("Found '861': ", test_string)
        test_string = re.sub('^861', '', test_string)
        if re.match('^61.*', test_string):
            print("Found '61' : ", test_string)
            test_string = re.sub('^61', '', test_string)
            for i in range(6):
                if re.match('^[0-9A-F]{2}', test_string):
                    temp = re.match('^[0-9A-F]{2}', test_string).group()
                    print("Found Code: ", temp)
                test_string = re.sub('^[0-9A-F]{2}', '', test_string)

现在您可以在此代码中看到，在每个步骤之后，我使用re.sub()删除我刚刚寻找的字符串部分。考虑到这一点，我的问题如下：

有没有办法解析字符串并找到我需要的数据，同时保持字符串完整？我现在拥有的效率会更高还是更低？

Answer 1

你不需要这个任务的正则表达式，你可以使用if / else块和一些字符串替换：

test_string = "T16F161A286161990200040000\r"

def process(input):
  # does a few stuff with 11, 14, 60, 61, F0, F1, or F2
  return

def stringToArray(input):
  return [tempToken[i:i+2] for i in range(0, len(tempToken), 2)]



if not test_string.startswith('T16F161A2'):
  print ("Does not match")
  quit()
else:
  print ("Does match")

tempToken = test_string[9:]

if tempToken.startswith('861'):
  process(tempToken) #does stuff with 11, 14, 60, 61, F0, F1, or F2
  tempToken = tempToken[5:]

  print (stringToArray(tempToken))
else:
  pass

你可以看到它here

Answer 2

我建议首先使用（因为你知道字符串的大小）：

通过比较test_string [：9] == T16F161A2

我也在第二阶段这样做（test_string [9:12]）。这种比较实际上比正则表达式要快得多。

使用已知大小时，您可以像我上面那样调用您的字符串。这不会失败＆＃34;你现在做的字符串。即re.search（pattern，test_string [9:12]）。

希望这对你有所帮助。 :)

Answer 3

假设每次字符串长度相同且数据位于同一索引中，您可以使用字符串[]拼接器。要获得前9个字符，请使用：test_string[:10] 您可以将它们设置为变量并使其更容易检查：

confirmation_code = test_string[:10]
nextThree = test_string[10:13]
#check values

这是python中的一个内置方法，因此可以说它效率非常高。

Answer 4

如果您想坚持使用正则表达式，那么可以这样做：

pattern = re.compile(r'^T16F161A2((861)|37F)(?(2)(11|14|60|61|F0|F1|F2)|[0-9A-F]{2})([0-9A-F]{12})$')
match_result = pattern.match(test_string)

在这种情况下，您可以检查match_result是否是有效的匹配对象（如果没有，则没有匹配的模式）。此匹配对象将包含4个元素： - 前3组（861或37F） - 无用的数据（忽略这个） - 第一个元素为2时的2个字符代码为861（否则为None） - 最后12位数

将最后12位数字拆分成一行：

last_12_digits = match_result[3]
last_digits = [last_12_digits[i:i+2] for i in range(0, len(last_12_digits), 2)]

Answer 5

你真的不需要正则表达式，因为你确切地知道你在寻找什么以及它应该在字符串中找到它，你可以使用切片和几个if / elif / else语句。像这样：

s = test_string.strip()
code, x, y, rest = s[:9], s[9:12], s[12:14], [s[i:i+2] for i in range(14, len(s), 2)]
# T16F161A2, 861, 61, ['99', '02', '00', '04', '00', '00']

if code == "T16F161A2":
    if x == "37F":
    elif x == "861":
        if y == "11":
            ...
        if y == "61":
            # do stuff with rest
    else:
        # invalid
else:
    # invalid

Answer 6

也许是这样的：

import re

regex = r'^T16F161A2(861|37f)(11|14|60|61|F0|F1|F2)(.{2})(.{2})(.{2})(.{2})(.{2})(.{2})$'
string = 'T16F161A286161990200040000'

print re.match(regex,string).groups()

这使用了捕获组，避免了必须创建一堆新字符串。

Answer 7

re模块不如直接子字符串访问有效，但它可以节省您编写（和维护）某些代码行。但是如果你想使用它，你应该匹配整个字符串：

import re

test_string = "T16F161A286161990200040000\r"

rx = re.compile(r'T16F161A2(?:(?:(37F)(.*))|(?:(861)(11|14|60|61|F0|F1|F2)(.*)))\r')
m = rx.match(test_string)      # => 5 groups, first 2 if 37F, last 3 if 861

if m is None:                  # string does not match:
    ...
elif m.group(1) is None:       # 861 type
    subtype = m.group(4)       # extract subtype
    # and group remaining characters by pairs
    elts = [ m.group(5)[i:i+2] for i in range(0, len(m.group(5)), 2) ]
    ...                        # process that
else:                          # 37F type
    ...

Python，复杂正则表达式解析器

7 个答案: