我有某些格式的数据,我无法正确解析。最初,我使用re.split分隔句点并有条件地加入某些元素,但是这产生了其他问题,我认为可以使用正则表达式解决,但我不知道如何正确格式化它。
数据可以采用以下格式
STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2
我遇到的问题是,使用正则表达式基于句点和斜杠进行拆分,这意味着,如果变量前面带有句点,则不包括句点。如果变量前面有句点,我希望能够以字符串形式显示该变量,例如var =“ .VARIABLE1”,同时var =“ VARIABLE.VARIABLE2”。我不需要存储静态字段,我只需要能够提取变量字段,而不管前面是否有一个,两个或一个带有立即数的字段。
我尝试使用re.search,但只能获取第一个静态字段。 我尝试使用re.split('。| /',line),但是随后遇到了无法解析前面带有句点的变量的问题,例如“ .car”而不是“ car”,或者我遇到了通过['。'。join(x [2:4])]手动将带有两个变量的变量连接起来,由于字段总数的可变性,我不想这样做。
对于给定的示例,我想要的输出将是两个单独的变量,其中包含来自输入的变量
x = VARIABLE1 y = VARIABLE2
x = VARIABLE1.VARIABLE2 y = VARIABLE3
x = .VARIABLE1 y = VARIABLE2
x = VARIABLE1 y = VARIABLE2
x = .VARIABLE1 y= VARIABLE2
x = re.split('\/', r)
numElements = len(x)
if(x[(numElements - 2)] == "STATICFIELD2"):
y[x[2]] = 1
else:
x[2:4] = ['.'.join(x[2:4])]
y[x[2]] = 1
x = re.search(r'(\bSTATICFIELD1.STATICFIELD2.\b+)(\b.STATICFIELD3/\b)',line)
答案 0 :(得分:0)
您可以从字符串中删除STATICFIELD模式,然后在斜杠上进行简单的分割:
import re
def splitXY(s) : return re.sub("(\.?STATICFIELD\d+\.?)","",s).split("/")
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y) # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3")
print(x,y) # VARIABLE1.VARIABLE2 VARIABLE3
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y) # .VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2")
print(x,y) # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2")
print(x,y) # .VARIABLE1 VARIABLE2
[更新]
如果您有某种逻辑可以区分STATICFIELD名称和VARIABLE名称,则可以使用split和join来解析字符串:
def isStatic(name): # this would be whatever logic distinguishes the names
return name != "" and name.startswith("STATICFIELD")
def splitXY(s) :
x,y = s.split("/")
x = ".".join(name for name in x.split(".") if not isStatic(name))
return x,y
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y) # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3")
print(x,y) # VARIABLE1.VARIABLE2 VARIABLE3
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2")
print(x,y) # .VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2")
print(x,y) # VARIABLE1 VARIABLE2
x,y = splitXY("STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2")
print(x,y) # .VARIABLE1 VARIABLE2
请确保isStatic()
以False
回应空名称
答案 1 :(得分:0)
因此,对于所提出的问题,我怀疑您感到沮丧,因为有人在说VARIABLE
与STATICFIELD
时认为自己是直白的,因为如果您是您,很可能会考虑改用使用findall。
如果这是您所需要的,则下面的内容应该可以工作,然后您可以对其进行处理
修改:选项1
>>> string = '''STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2'''
>>> def isolate_variables(string):
import re
result = []
for line in string.split('\n'):
x,y = re.findall('(?i)(?:(?<=\s|\.|\/)|(?<=^))(VARIABLE[\d]+?[\.]+(?:VARIABLE[\d]*)+|(?:(?<=\s|\.|\/)|(?<=^))[\.]*VARIABLE[\d]+?)(?=[\.\/\n\ ]|$)', line)
result.append((x,y))
print(result)
return result
>>> isolate_variables(string)
#OUTPUT
[('VARIABLE1', 'VARIABLE2'), ('VARIABLE1.VARIABLE2', 'VARIABLE3'), ('.VARIABLE1', 'VARIABLE2'), ('VARIABLE1', 'VARIABLE2'), ('.VARIABLE1', 'VARIABLE2')]
。
选项2-您只需要在之后进行处理
>>> import re
>>> string = '''STATICFIELD1.STATICFIELD2.VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1.VARIABLE2.STATICFIELD3/VARIABLE3
STATICFIELD1.STATICFIELD2..VARIABLE1.STATICFIELD3/VARIABLE2
STATICFIELD1.STATICFIELD2.VARIABLE1/VARIABLE2
STATICFIELD1.STATICFIELD2..VARIABLE1/VARIABLE2'''
>>> re.findall('(?i)(?:(?<=\s|\.|\/)|(?<=^))(VARIABLE[\d]+?[\.]+(?:VARIABLE[\d]*)+|(?:(?<=\s|\.|\/)|(?<=^))[\.]*VARIABLE[\d]+?)(?=[\.\/\n\ ]|$)', string)
#OUTPUT
['VARIABLE1', 'VARIABLE2', 'VARIABLE1.VARIABLE2', 'VARIABLE3', '.VARIABLE1', 'VARIABLE2', 'VARIABLE1', 'VARIABLE2', '.VARIABLE1', 'VARIABLE2']