Question

我想将以下字符串拆分为＆＃39;和＆＃39;除了“＆＃39;和＆＃39;在引号内

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

期望的结果

["section_category_name = 'computer and equipment expense'","date >= 2015-01-01","date <= 2015-03-31"]

我似乎无法找到正确分割字符串的正确的正则表达式模式，以便计算机和设备费用高昂。不分裂。

这是我尝试的内容：

re.split('and',string)

结果

[" section_category_name = 'computer "," equipment expense' ",' date >= 2015-01-01 ',' date <= 2015-03-31']

正如您所看到的，结果已经分解了计算机和设备的费用＆＃39;列表中的不同项目。

我还从this question尝试了以下内容：

r = re.compile('(?! )[^[]+?(?= *\[)'
               '|'
               '\[.+?\]')
r.findall(s)

结果：

[]

我还尝试了question

中的以下内容

result = re.split(r"and+(?=[^()]*(?:\(|$))", string)

结果：

[" section_category_name = 'computer ",
 " equipment expense' ",
 ' date >= 2015-01-01 ',
 ' date <= 2015-03-31']

挑战在于，关于该主题的先前问题没有解决如何在引号内用字分割字符串，因为它们解决了如何通过特殊字符或空格分割字符串。

如果我将字符串修改为以下

，我就能得到所需的结果

string = " section_category_name = (computer and equipment expense) and date >= 2015-01-01 and date <= 2015-03-31"
result = re.split(r"and+(?=[^()]*(?:\(|$))", string)

期望的结果

[' section_category_name = (computer and equipment expense) ',
 ' date >= 2015-01-01 ',
 ' date <= 2015-03-31']

但是我需要的功能是不要拆分＆＃39;和＆＃39;在撇号而不是括号内

Answer 1

您可以使用re.findall生成2元组列表，其中第一个元素是带引号的字符串或为空，或者第二个元素是除空白字符之外的任何元素或为空。

然后，您可以使用itertools.groupby分隔单词＆＃34;和＆＃34; （当不在带引号的字符串中时），然后从list-comp中的填充元素重新加入，例如：

import re
from itertools import groupby

text = "section_category_name = 'computer and equipment expense'      and date >= 2015-01-01 and date <= 2015-03-31 and blah = 'ooops'"
items = [
    ' '.join(el[0] or el[1] for el in g)
    for k, g in groupby(re.findall("('.*?')|(\S+)", text), lambda L: L[1] == 'and')
    if not k
]

给你：

["section_category_name = 'computer and equipment expense'",
 'date >= 2015-01-01',
 'date <= 2015-03-31',
 "blah = 'ooops'"]

请注意，空格也会在引用的字符串之外进行标准化 - 不管这是否合适...

另请注意 - 这确实在分组方面具有一定的灵活性，因此如果需要，您可以将lambda L: L[1] == 'and'更改为lambda L: L[1] in ('and', 'or')以对不同的字词进行分组...

Answer 2

您可以将以下正则表达式与re.findall：

一起使用

((?:(?!\band\b)[^'])*(?:'[^'\\]*(?:\\.[^'\\]*)*'(?:(?!\band\b)[^'])*)*)(?:and|$)

请参阅regex demo。

正则表达式包含一个未打开的序列，除了'之外的任何内容，直到第一个and（带有调和的贪婪令牌(?:(?!\band\b)[^'])*）和任何内容（支持转义的实体）并包括单撇号（带'[^'\\]*(?:\\.[^'\\]*)*' - 这也是([^'\\]|\\.)*的未打包版本。

Python code demo：

import re
p = re.compile(r'((?:(?!\band\b)[^\'])*(?:\'[^\'\\]*(?:\\.[^\'\\]*)*\'(?:(?!\band\b)[^\'])*)*)(?:and|$)')
s = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
print([x for x in p.findall(s) if x])

Answer 3

如果所有字符串都遵循相同的模式，则可以使用正则表达式将搜索划分为3组。从头到尾的第一组＆＃39;。然后下一组是第一个和最后一个＆＃34;和＆＃34;之间的所有内容。最后一组是文本的其余部分。

import re

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

output = re.match(r"(^.+['].+['])\sand\s(.+)\sand\s(.+)", string).groups()
print(output)

每个组都在正则表达式的括号内定义。方括号定义了要匹配的特定字符。此示例仅在＆＃34; section_category_name＆＃34;等于单引号内的东西。

section_category_name = 'something here' and ...

Answer 4

以下代码将起作用，并且不需要疯狂的正则表达式来实现它。

import re

# We create a "lexer" using regex. This will match strings surrounded by single quotes,
# words without any whitespace in them, and the end of the string. We then use finditer()
# to grab all non-overlapping tokens.
lexer = re.compile(r"'[^']*'|[^ ]+|$")

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

results = []
buff = []

# Iterate through all the tokens our lexer identified and parse accordingly
for match in lexer.finditer(string):
    token = match.group(0) # group 0 is the entire matching string

    if token in ('and', ''):
        # Once we reach 'and' or the end of the string '' (matched by $)
        # We join all previous tokens with a space and add to our results.
        results.append(' '.join(buff))
        buff = [] # Reset for the next set of tokens
    else:
        buff.append(token)

print results

Demo

编辑：这是一个更简洁的版本，有效地用itertools.groupby替换上述语句中的for循环。

import re
from itertools import groupby

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"

lexer = re.compile(r"'[^']*'|[^\s']+")
grouping = groupby(lexer.findall(string), lambda x: x == 'and')
results = [ ' '.join(g) for k, g in grouping if not k ]

print results

Demo

Answer 5

我只会使用re.split具有此功能的事实：

如果在模式中使用捕获括号，则模式中所有组的文本也将作为结果列表的一部分返回。

结合使用两个捕获组将返回None个分隔字符串的列表。这使得正则表达式变得简单，尽管需要一些后合并。

>>> tokens = re.split(r"('[^']*')|and", string)
# ['section_category_name = ', "'computer and equipment expense'", ' ', None, ' date >= 2015-01-01 ', None, ' date <= 2015-03-31']    
>>> ''.join([t if t else '\0' for t in tokens]).split('\0')
["section_category_name = 'computer and equipment expense' ", ' date >= 2015-01-01 ', ' date <= 2015-03-31']

注意，0x00 char在那里用作临时分隔符，因此如果您需要处理带有空值的字符串，它将无法正常工作。

Answer 6

我不确定你想要对and周围的空格做什么，以及你想对字符串中重复的and做什么。如果您的字符串为'hello and and bye'或'hello andand bye'？

，您会想要什么？

我没有测试过所有的角落情况，并且我在“和”周围删除了空白，这可能是你想要的，也可能不是你想要的：

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
res = []
spl = 'and'
for idx, sub in enumerate(string.split("'")):
  if idx % 2 == 0:
    subsub = sub.split(spl)
    for jdx in range(1, len(subsub) - 1):
      subsub[jdx] = subsub[jdx].strip()
    if len(subsub) > 1:
      subsub[0] = subsub[0].rstrip()
      subsub[-1] = subsub[-1].lstrip()
    res += [i for i in subsub if i.strip()]
  else:
    quoted_str = "'" + sub + "'"
    if res:
      res[-1] += quoted_str
    else:
      res.append(quoted_str)

更简单的解决方案，如果您知道and将被任意一侧的空格包围，并且不会重复，并且不想删除额外的空格：

string = "section_category_name = 'computer and equipment expense' and date >= 2015-01-01 and date <= 2015-03-31"
spl = 'and'
res = []
spaced_spl = ' ' + spl + ' '
for idx, sub in enumerate(string.split("'")):
  if idx % 2 == 0:
    res += [i for i in sub.split(spaced_spl) if i.strip()]
  else:
    quoted_str = "'" + sub + "'"
    if res:
      res[-1] += quoted_str
    else:
      res.append(quoted_str)

输出：

["section_category_name = 'computer and equipment expense'", 'date >= 2015-01-01', 'date <= 2015-03-31']

如何用字符串拆分字符串，除非字符串在python中用引号括起来？

6 个答案: