我想将以下字符串拆分为'和'除了“'和'在引号内
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
期望的结果
["section_category_name = 'computer and equipment expense'","date >= 2015-01-01","date <= 2015-03-31"]
我似乎无法找到正确分割字符串的正确的正则表达式模式,以便计算机和设备费用高昂。不分裂。
这是我尝试的内容:
re.split('and',string)
结果
[" section_category_name = 'computer "," equipment expense' ",' date >= 2015-01-01 ',' date <= 2015-03-31']
正如您所看到的,结果已经分解了计算机和设备的费用&#39;列表中的不同项目。
我还从this question尝试了以下内容:
r = re.compile('(?! )[^[]+?(?= *\[)'
'|'
'\[.+?\]')
r.findall(s)
结果:
[]
我还尝试了question
中的以下内容result = re.split(r"and+(?=[^()]*(?:\(|$))", string)
结果:
[" section_category_name = 'computer ",
" equipment expense' ",
' date >= 2015-01-01 ',
' date <= 2015-03-31']
挑战在于,关于该主题的先前问题没有解决如何在引号内用字分割字符串,因为它们解决了如何通过特殊字符或空格分割字符串。
如果我将字符串修改为以下
,我就能得到所需的结果string = " section_category_name = (computer and equipment expense) and date >= 2015-01-01 and date <= 2015-03-31"
result = re.split(r"and+(?=[^()]*(?:\(|$))", string)
期望的结果
[' section_category_name = (computer and equipment expense) ',
' date >= 2015-01-01 ',
' date <= 2015-03-31']
但是我需要的功能是不要拆分&#39;和&#39;在撇号而不是括号内
答案 0 :(得分:1)
您可以使用re.findall
生成2元组列表,其中第一个元素是带引号的字符串或为空,或者第二个元素是除空白字符之外的任何元素或为空。
然后,您可以使用itertools.groupby
分隔单词&#34;和&#34; (当不在带引号的字符串中时),然后从list-comp中的填充元素重新加入,例如:
import re
from itertools import groupby
text = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31 and blah = 'ooops'"
items = [
' '.join(el[0] or el[1] for el in g)
for k, g in groupby(re.findall("('.*?')|(\S+)", text), lambda L: L[1] == 'and')
if not k
]
给你:
["section_category_name = 'computer and equipment expense'",
'date >= 2015-01-01',
'date <= 2015-03-31',
"blah = 'ooops'"]
请注意,空格也会在引用的字符串之外进行标准化 - 不管这是否合适...
另请注意 - 这确实在分组方面具有一定的灵活性,因此如果需要,您可以将lambda L: L[1] == 'and'
更改为lambda L: L[1] in ('and', 'or')
以对不同的字词进行分组...
答案 1 :(得分:0)
您可以将以下正则表达式与re.findall
:
((?:(?!\band\b)[^'])*(?:'[^'\\]*(?:\\.[^'\\]*)*'(?:(?!\band\b)[^'])*)*)(?:and|$)
请参阅regex demo。
正则表达式包含一个未打开的序列,除了'
之外的任何内容,直到第一个and
(带有调和的贪婪令牌(?:(?!\band\b)[^'])*
)和任何内容(支持转义的实体)并包括单撇号(带'[^'\\]*(?:\\.[^'\\]*)*'
- 这也是([^'\\]|\\.)*
的未打包版本。
Python code demo:
import re
p = re.compile(r'((?:(?!\band\b)[^\'])*(?:\'[^\'\\]*(?:\\.[^\'\\]*)*\'(?:(?!\band\b)[^\'])*)*)(?:and|$)')
s = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
print([x for x in p.findall(s) if x])
答案 2 :(得分:0)
如果所有字符串都遵循相同的模式,则可以使用正则表达式将搜索划分为3组。从头到尾的第一组&#39;。然后下一组是第一个和最后一个&#34;和&#34;之间的所有内容。最后一组是文本的其余部分。
import re
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
output = re.match(r"(^.+['].+['])\sand\s(.+)\sand\s(.+)", string).groups()
print(output)
每个组都在正则表达式的括号内定义。方括号定义了要匹配的特定字符。此示例仅在&#34; section_category_name&#34;等于单引号内的东西。
section_category_name = 'something here' and ...
答案 3 :(得分:0)
以下代码将起作用,并且不需要疯狂的正则表达式来实现它。
import re
# We create a "lexer" using regex. This will match strings surrounded by single quotes,
# words without any whitespace in them, and the end of the string. We then use finditer()
# to grab all non-overlapping tokens.
lexer = re.compile(r"'[^']*'|[^ ]+|$")
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
results = []
buff = []
# Iterate through all the tokens our lexer identified and parse accordingly
for match in lexer.finditer(string):
token = match.group(0) # group 0 is the entire matching string
if token in ('and', ''):
# Once we reach 'and' or the end of the string '' (matched by $)
# We join all previous tokens with a space and add to our results.
results.append(' '.join(buff))
buff = [] # Reset for the next set of tokens
else:
buff.append(token)
print results
编辑:这是一个更简洁的版本,有效地用itertools.groupby
替换上述语句中的for循环。
import re
from itertools import groupby
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
lexer = re.compile(r"'[^']*'|[^\s']+")
grouping = groupby(lexer.findall(string), lambda x: x == 'and')
results = [ ' '.join(g) for k, g in grouping if not k ]
print results
答案 4 :(得分:0)
我只会使用re.split
具有此功能的事实:
如果在模式中使用捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。
结合使用两个捕获组将返回None
个分隔字符串的列表。
这使得正则表达式变得简单,尽管需要一些后合并。
>>> tokens = re.split(r"('[^']*')|and", string)
# ['section_category_name = ', "'computer and equipment expense'", ' ', None, ' date >= 2015-01-01 ', None, ' date <= 2015-03-31']
>>> ''.join([t if t else '\0' for t in tokens]).split('\0')
["section_category_name = 'computer and equipment expense' ", ' date >= 2015-01-01 ', ' date <= 2015-03-31']
注意,0x00
char在那里用作临时分隔符,因此如果您需要处理带有空值的字符串,它将无法正常工作。
答案 5 :(得分:0)
我不确定你想要对and
周围的空格做什么,以及你想对字符串中重复的and
做什么。如果您的字符串为'hello and and bye'
或'hello andand bye'
?
我没有测试过所有的角落情况,并且我在“和”周围删除了空白,这可能是你想要的,也可能不是你想要的:
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
res = []
spl = 'and'
for idx, sub in enumerate(string.split("'")):
if idx % 2 == 0:
subsub = sub.split(spl)
for jdx in range(1, len(subsub) - 1):
subsub[jdx] = subsub[jdx].strip()
if len(subsub) > 1:
subsub[0] = subsub[0].rstrip()
subsub[-1] = subsub[-1].lstrip()
res += [i for i in subsub if i.strip()]
else:
quoted_str = "'" + sub + "'"
if res:
res[-1] += quoted_str
else:
res.append(quoted_str)
更简单的解决方案,如果您知道and
将被任意一侧的空格包围,并且不会重复,并且不想删除额外的空格:
string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
spl = 'and'
res = []
spaced_spl = ' ' + spl + ' '
for idx, sub in enumerate(string.split("'")):
if idx % 2 == 0:
res += [i for i in sub.split(spaced_spl) if i.strip()]
else:
quoted_str = "'" + sub + "'"
if res:
res[-1] += quoted_str
else:
res.append(quoted_str)
输出:
["section_category_name = 'computer and equipment expense'", 'date >= 2015-01-01', 'date <= 2015-03-31']