如何使用正则表达式正确解析复杂字符串中的元素

时间:2019-04-02 18:09:51

标签: python regex

我有某些格式的数据,我无法正确解析。最初,我使用re.split分隔句点并有条件地加入某些元素,但是这产生了其他问题,我认为可以使用正则表达式解决,但我不知道如何正确格式化它。

数据可以采用以下格式

STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2

我遇到的问题是,使用正则表达式基于句点和斜杠进行拆分,这意味着,如果变量前面带有句点,则不包括句点。如果变量前面有句点,我希望能够以字符串形式显示该变量,例如var =“ .VARIABLE1”,同时var =“ VARIABLE.VARIABLE2”。我不需要存储静态字段,我只需要能够提取变量字段,而不管前面是否有一个,两个或一个带有立即数的字段。

我尝试使用re.search,但只能获取第一个静态字段。 我尝试使用re.split('。| /',line),但是随后遇到了无法解析前面带有句点的变量的问题,例如“ .car”而不是“ car”,或者我遇到了通过['。'。join(x [2:4])]手动将带有两个变量的变量连接起来,由于字段总数的可变性,我不想这样做。

对于给定的示例,我想要的输出将是两个单独的变量,其中包含来自输​​入的变量

x = VARIABLE1 y = VARIABLE2
x = VARIABLE1.VARIABLE2 y = VARIABLE3
x = .VARIABLE1 y = VARIABLE2
x = VARIABLE1 y = VARIABLE2
x = .VARIABLE1 y= VARIABLE2

    x = re.split('\/', r)
    numElements = len(x)
    if(x[(numElements - 2)] == "STATICFIELD2"):
        y[x[2]] = 1
        else:
            x[2:4] = ['.'.join(x[2:4])]
        y[x[2]] = 1
    x = re.search(r'(\bSTATICFIELD1.STATICFIELD2.\b+)(\b.STATICFIELD3/\b)',line)

2 个答案:

答案 0 :(得分:0)

您可以从字符串中删除STATICFIELD模式,然后在斜杠上进行简单的分割:

import re

def splitXY(s) : return re.sub("(\.?STATICFIELD\d+\.?)","",s).split("/")

x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3")
print(x,y)  # VARIABLE1.VARIABLE2 VARIABLE3
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2

[更新]

如果您有某种逻辑可以区分STATICFIELD名称和VARIABLE名称,则可以使用split和join来解析字符串:

def isStatic(name): # this would be whatever logic distinguishes the names
    return name != "" and name.startswith("STATICFIELD")

def splitXY(s) :
    x,y = s.split("/")
    x =  ".".join(name for name in x.split(".") if not isStatic(name))
    return x,y

x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3")
print(x,y)  # VARIABLE1.VARIABLE2 VARIABLE3
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2")
print(x,y)  # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2")
print(x,y)  # .VARIABLE1 VARIABLE2

请确保isStatic()False回应空名称

答案 1 :(得分:0)

因此,对于所提出的问题,我怀疑您感到沮丧,因为有人在说VARIABLESTATICFIELD时认为自己是直白的,因为如果您是您,很可能会考虑改用使用findall。

如果这是您所需要的,则下面的内容应该可以工作,然后您可以对其进行处理

修改:选项1

>>> string = '''STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2'''



>>> def isolate_variables(string):
        import re
        result = []
        for line in string.split('\n'):
            x,y = re.findall('(?i)(?:(?<=\s|\.|\/)|(?<=^))(VARIABLE[\d]+?[\.]+(?:VARIABLE[\d]*)+|(?:(?<=\s|\.|\/)|(?<=^))[\.]*VARIABLE[\d]+?)(?=[\.\/\n\ ]|$)', line)
            result.append((x,y))
        print(result)
        return result



>>> isolate_variables(string)



#OUTPUT
[('VARIABLE1', 'VARIABLE2'), ('VARIABLE1.VARIABLE2', 'VARIABLE3'), ('.VARIABLE1', 'VARIABLE2'), ('VARIABLE1', 'VARIABLE2'), ('.VARIABLE1', 'VARIABLE2')]

选项2-您只需要在之后进行处理

>>> import re


>>> string = '''STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2'''


>>> re.findall('(?i)(?:(?<=\s|\.|\/)|(?<=^))(VARIABLE[\d]+?[\.]+(?:VARIABLE[\d]*)+|(?:(?<=\s|\.|\/)|(?<=^))[\.]*VARIABLE[\d]+?)(?=[\.\/\n\ ]|$)', string)



#OUTPUT
['VARIABLE1', 'VARIABLE2', 'VARIABLE1.VARIABLE2', 'VARIABLE3', '.VARIABLE1', 'VARIABLE2', 'VARIABLE1', 'VARIABLE2', '.VARIABLE1', 'VARIABLE2']