如何用字符串拆分字符串,除非字符串在python中用引号括起来?

时间:2015-12-23 21:54:08

标签: python regex string

我想将以下字符串拆分为'和'除了“'和'在引号内

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

期望的结果

["section_category_name = 'computer and equipment expense'","date >= 2015-01-01","date <= 2015-03-31"]

我似乎无法找到正确分割字符串的正确的正则表达式模式,以便计算机和设备费用高昂。不分裂。

这是我尝试的内容:

re.split('and',string)

结果

[" section_category_name = 'computer "," equipment expense' ",' date >= 2015-01-01 ',' date <= 2015-03-31']

正如您所看到的,结果已经分解了计算机和设备的费用&#39;列表中的不同项目。

我还从this question尝试了以下内容:

r = re.compile('(?! )[^[]+?(?= *\[)'
               '|'
               '\[.+?\]')
r.findall(s)

结果:

[]

我还尝试了question

中的以下内容
result = re.split(r"and+(?=[^()]*(?:\(|$))", string)

结果:

[" section_category_name = 'computer ",
 " equipment expense' ",
 ' date >= 2015-01-01 ',
 ' date <= 2015-03-31']

挑战在于,关于该主题的先前问题没有解决如何在引号内用字分割字符串,因为它们解决了如何通过特殊字符或空格分割字符串。

如果我将字符串修改为以下

,我就能得到所需的结果
string = " section_category_name = (computer and equipment expense) and date >= 2015-01-01 and date <= 2015-03-31"
result = re.split(r"and+(?=[^()]*(?:\(|$))", string)

期望的结果

[' section_category_name = (computer and equipment expense) ',
 ' date >= 2015-01-01 ',
 ' date <= 2015-03-31']

但是我需要的功能是不要拆分&#39;和&#39;在撇号而不是括号内

6 个答案:

答案 0 :(得分:1)

您可以使用re.findall生成2元组列表,其中第一个元素是带引号的字符串或为空,或者第二个元素是除空白字符之外的任何元素或为空。

然后,您可以使用itertools.groupby分隔单词&#34;和&#34; (当不在带引号的字符串中时),然后从list-comp中的填充元素重新加入,例如:

import re
from itertools import groupby

text = "section_category_name = 'computer and equipment expense'      and date >= 2015-01-01 and date <= 2015-03-31 and blah = 'ooops'"
items = [
    ' '.join(el[0] or el[1] for el in g)
    for k, g in groupby(re.findall("('.*?')|(\S+)", text), lambda L: L[1] == 'and')
    if not k
]

给你:

["section_category_name = 'computer and equipment expense'",
 'date >= 2015-01-01',
 'date <= 2015-03-31',
 "blah = 'ooops'"]

请注意,空格也会在引用的字符串之外进行标准化 - 不管这是否合适...

另请注意 - 这确实在分组方面具有一定的灵活性,因此如果需要,您可以将lambda L: L[1] == 'and'更改为lambda L: L[1] in ('and', 'or')以对不同的字词进行分组...

答案 1 :(得分:0)

您可以将以下正则表达式与re.findall

一起使用
((?:(?!\band\b)[^'])*(?:'[^'\\]*(?:\\.[^'\\]*)*'(?:(?!\band\b)[^'])*)*)(?:and|$)

请参阅regex demo

正则表达式包含一个未打开的序列,除了'之外的任何内容,直到第一个and(带有调和的贪婪令牌(?:(?!\band\b)[^'])*)和任何内容(支持转义的实体)并包括单撇号(带'[^'\\]*(?:\\.[^'\\]*)*' - 这也是([^'\\]|\\.)*的未打包版本。

Python code demo

import re
p = re.compile(r'((?:(?!\band\b)[^\'])*(?:\'[^\'\\]*(?:\\.[^\'\\]*)*\'(?:(?!\band\b)[^\'])*)*)(?:and|$)')
s = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
print([x for x in p.findall(s) if x])

答案 2 :(得分:0)

如果所有字符串都遵循相同的模式,则可以使用正则表达式将搜索划分为3组。从头到尾的第一组&#39;。然后下一组是第一个和最后一个&#34;和&#34;之间的所有内容。最后一组是文本的其余部分。

import re

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

output = re.match(r"(^.+['].+['])\sand\s(.+)\sand\s(.+)", string).groups()
print(output)

每个组都在正则表达式的括号内定义。方括号定义了要匹配的特定字符。此示例仅在&#34; section_category_name&#34;等于单引号内的东西。

section_category_name = 'something here' and ...

答案 3 :(得分:0)

以下代码将起作用,并且不需要疯狂的正则表达式来实现它。

import re

# We create a "lexer" using regex. This will match strings surrounded by single quotes,
# words without any whitespace in them, and the end of the string. We then use finditer()
# to grab all non-overlapping tokens.
lexer = re.compile(r"'[^']*'|[^ ]+|$")

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

results = []
buff = []

# Iterate through all the tokens our lexer identified and parse accordingly
for match in lexer.finditer(string):
    token = match.group(0) # group 0 is the entire matching string

    if token in ('and', ''):
        # Once we reach 'and' or the end of the string '' (matched by $)
        # We join all previous tokens with a space and add to our results.
        results.append(' '.join(buff))
        buff = [] # Reset for the next set of tokens
    else:
        buff.append(token)

print results

Demo

编辑:这是一个更简洁的版本,有效地用itertools.groupby替换上述语句中的for循环。

import re
from itertools import groupby

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

lexer = re.compile(r"'[^']*'|[^\s']+")
grouping = groupby(lexer.findall(string), lambda x: x == 'and')
results = [ ' '.join(g) for k, g in grouping if not k ]

print results

Demo

答案 4 :(得分:0)

我只会使用re.split具有此功能的事实:

  

如果在模式中使用捕获括号,则模式中所有组的文本也将作为结果列表的一部分返回。

结合使用两个捕获组将返回None个分隔字符串的列表。 这使得正则表达式变得简单,尽管需要一些后合并。

>>> tokens = re.split(r"('[^']*')|and", string)
# ['section_category_name = ', "'computer and equipment expense'", ' ', None, ' date >= 2015-01-01 ', None, ' date <= 2015-03-31']    
>>> ''.join([t if t else '\0' for t in tokens]).split('\0')
["section_category_name = 'computer and equipment expense' ", ' date >= 2015-01-01 ', ' date <= 2015-03-31']

注意,0x00 char在那里用作临时分隔符,因此如果您需要处理带有空值的字符串,它将无法正常工作。

答案 5 :(得分:0)

我不确定你想要对and周围的空格做什么,以及你想对字符串中重复的and做什么。如果您的字符串为'hello and and bye''hello andand bye'

,您会想要什么?

我没有测试过所有的角落情况,并且我在“和”周围删除了空白,这可能是你想要的,也可能不是你想要的:

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
res = []
spl = 'and'
for idx, sub in enumerate(string.split("'")):
  if idx % 2 == 0:
    subsub = sub.split(spl)
    for jdx in range(1, len(subsub) - 1):
      subsub[jdx] = subsub[jdx].strip()
    if len(subsub) > 1:
      subsub[0] = subsub[0].rstrip()
      subsub[-1] = subsub[-1].lstrip()
    res += [i for i in subsub if i.strip()]
  else:
    quoted_str = "'" + sub + "'"
    if res:
      res[-1] += quoted_str
    else:
      res.append(quoted_str)

更简单的解决方案,如果您知道and将被任意一侧的空格包围,并且不会重复,并且不想删除额外的空格:

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
spl = 'and'
res = []
spaced_spl = ' ' + spl + ' '
for idx, sub in enumerate(string.split("'")):
  if idx % 2 == 0:
    res += [i for i in sub.split(spaced_spl) if i.strip()]
  else:
    quoted_str = "'" + sub + "'"
    if res:
      res[-1] += quoted_str
    else:
      res.append(quoted_str)

输出:

["section_category_name = 'computer and equipment expense'", 'date >= 2015-01-01', 'date <= 2015-03-31']