使用正则表达式对字符串进行拆分,文本中包含模式

时间:2018-11-22 14:17:04

标签: python regex token

我有很多字符串需要用逗号分隔。示例:

myString = r'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
myString = r'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'

我想要的输出是:

["test", "Test", "NEAR(this,that,DISTANCE=4)", "test again", """another test"""] #list length = 5

我不知道如何在一个项目中保持“ this,that,DISTANCE”之间的逗号。我尝试过:

l = re.compile(r',').split(myString) # matches all commas
l = re.compile(r'(?<!\(),(?=\))').split(myString) # (negative lookback/lookforward) - no matches at all

有什么想法吗?假设允许的“功能”列表定义为:

f = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]

2 个答案:

答案 0 :(得分:2)

您可以使用

(?:\([^()]*\)|[^,])+

请参见the regex demo

(?:\([^()]*\)|[^,])+模式匹配括号中没有(),以外的其他任何字符的括号中一个或多个子串的出现。

请参见Python demo

import re
rx = r"(?:\([^()]*\)|[^,])+"
s = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
print(re.findall(rx, s))
# => ['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']

答案 1 :(得分:0)

如果显式地希望指定将哪些字符串计为函数,则需要动态构建正则表达式。否则,请使用Wiktor的解决方案。

>>> functions = ["NEAR","FOLLOWEDBY","AND","OR","MAX"]
>>> funcs = '|'.join('{}\([^\)]+\)'.format(f) for f in functions)
>>> regex = '({})|,'.format(funcs)
>>>
>>> myString1 = 'test,Test,NEAR(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString1)))
['test', 'Test', 'NEAR(this,that,DISTANCE=4)', 'test again', '"another test"']
>>> myString2 = 'test,Test,FOLLOWEDBY(this,that,DISTANCE=4),test again,"another test"'
>>> list(filter(None, re.split(regex, myString2)))
['test',
 'Test',
 'FOLLOWEDBY(this,that,DISTANCE=4)',
 'test again',
 '"another test"']