假设我有以下string
,其中包含从SELECT
子句中提取的SQL语句(实际上,这是一个包含数百个此类语句的巨大SQL语句);
SUM(case when(A.money-B.money>1000
and A.unixtime-B.unixtime<=890769
and B.col10 = "A"
and B.col11 = "12"
and B.col12 = "V") then 10
end) as finalCond0,
MAX(case when(A.money-B.money<0
and A.unixtime-B.unixtime<=6786000
and B.cond1 = "A"
and B.cond2 = "4321"
and B.cond3 in ("E", "F", "G")) then A.col10
end) as finalCond1,
SUM(case when(A.money-B.money>0
and A.unixtime-B.unixtime<=6786000
and B.cond1 = "A"
and B.cond2 = "1234"
and B.cond3 in ("A", "B", "C")) then 2
end) as finalCond2
如何在功能上拆分此查询(即SUM
,MAX
,MIN
,MEAN
等),这样我可以提取上一个查询但不删除分隔符(在本例中为SUM
)?
因此所需的输出将是类似于以下内容的字符串:
SUM(case when(A.money-B.money>0
and A.unixtime-B.unixtime<=6786000
and B.cond1 = "A"
and B.cond2 = "1234"
and B.cond3 in ("A", "B", "C")) then 2
end) as finalCond2
PS:出于演示目的,我提供了某种缩进,但实际上,这些语句之间用逗号分隔,这意味着原始格式中不会出现空格或换行。
答案 0 :(得分:2)
您不能在此处使用正则表达式,因为SQL语法无法形成正则模式,因此可以与Python re
引擎匹配。您实际上必须将字符串解析到令牌流或语法树中。您的SUM(...)
可以包含多种语法,包括子选择。
sqlparse
library可以做到这一点,即使它是bit underdocumented and not that friendly to external uses。
重新使用我在链接到的另一篇文章中定义的walk_tokens
函数:
from collections import deque
from sqlparse.sql import TokenList
def walk_tokens(token):
queue = deque([token])
while queue:
token = queue.popleft()
if isinstance(token, TokenList):
queue.extend(token)
yield token
从SELECT
标识符列表中提取最后一个元素是:
import sqlparse
from sqlparse.sql import IdentifierList
tokens = sqlparse.parse(sql)[0]
for tok in walk_tokens(tokens):
if isinstance(tok, IdentifierList):
# iterate to leave the last assigned to `identifier`
for identifier in tok.get_identifiers():
pass
break
print(identifier)
演示:
>>> sql = '''\
... SUM(case when(A.money-B.money>1000
... and A.unixtime-B.unixtime<=890769
... and B.col10 = "A"
... and B.col11 = "12"
... and B.col12 = "V") then 10
... end) as finalCond0,
... MAX(case when(A.money-B.money<0
... and A.unixtime-B.unixtime<=6786000
... and B.cond1 = "A"
... and B.cond2 = "4321"
... and B.cond3 in ("E", "F", "G")) then A.col10
... end) as finalCond1,
... SUM(case when(A.money-B.money>0
... and A.unixtime-B.unixtime<=6786000
... and B.cond1 = "A"
... and B.cond2 = "1234"
... and B.cond3 in ("A", "B", "C")) then 2
... end) as finalCond2
... '''
>>> tokens = sqlparse.parse(sql)[0]
>>> for tok in walk_tokens(tokens):
... if isinstance(tok, IdentifierList):
... # iterate to leave the last assigned to `identifier`
... for identifier in tok.get_identifiers():
... pass
... break
...
>>> print(identifier)
SUM(case when(A.money-B.money>0
and A.unixtime-B.unixtime<=6786000
and B.cond1 = "A"
and B.cond2 = "1234"
and B.cond3 in ("A", "B", "C")) then 2
end) as finalCond2
identifier
是一个sqlparse.sql.Identifier
实例,但是再次将其转换为字符串(print()
可以,或者您可以只使用str()
)为您提供输入SQL字符串再次针对该部分。
答案 1 :(得分:0)
我有一个解决方案,但是代码太多了。这无需使用regex
,只需对关键字进行多次拆分即可。
s = """
SUM(case when(A.money-B.money>1000
and A.unixtime-B.unixtime<=890769
and B.col10 = "A"
and B.col11 = "12"
and B.col12 = "V") then 10
end) as finalCond0,
MAX(case when(A.money-B.money<0
and A.unixtime-B.unixtime<=6786000
and B.cond1 = "A"
and B.cond2 = "4321"
and B.cond3 in ("E", "F", "G")) then A.col10
end) as finalCond1,
SUM(case when(A.money-B.money>0
and A.unixtime-B.unixtime<=6786000
and B.cond1 = "A"
and B.cond2 = "1234"
and B.cond3 in ("A", "B", "C")) then 2
end) as finalCond2
"""
# remove newlines and doble spaces
s = s.replace('\n', ' ')
while ' ' in s:
s = s.replace(' ', ' ')
s = s.strip()
# split on keywords, starting with the original string
current_parts = [s, ]
for kw in ['SUM', 'MAX', 'MIN']:
new_parts = []
for part in current_parts:
for i, new_part in enumerate(part.split(kw)):
if i > 0:
# add keyword to the start of this substring
new_part = '{}{}'.format(kw, new_part)
new_part = new_part.strip()
if len(new_part) > 0:
new_parts.append(new_part.strip())
current_parts = new_parts
print()
print('current_parts:')
for s in current_parts:
print(s)
我得到的输出是:
current_parts:
SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0,
MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1,
SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2
它对您有用吗?对于您在问题中输入的示例字符串,这似乎很有效。
答案 2 :(得分:0)
您可以使用类似的内容:
import re
str = 'SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0, MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1, SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2'
result = re.finditer('as\s+[a-zA-Z0-9]+', str);
commas = []
parts = []
for reg in result:
end = reg.end()
if(len(str) > end and str[end] == ','):
commas.append(end)
idx = 0
for comma in commas:
parts.append(str[idx:comma])
idx = comma + 1
parts.append(str[idx:])
print(parts)
在commas
数组中,您将需要分隔逗号。输出将是:
[151, 322]
在零件中,您将拥有零件的最终阵列(不确定此实现是否是最佳方法):
[
'SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0',
' MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1',
' SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2'
]