对Pyparsing中setResultsName的行为感到困惑

时间:2013-06-02 01:20:40

标签: sql parsing pyparsing

我正在尝试解析一些SQL语句。这是一个示例:

select
    ms.member_sk a,
    dd.date_sk b,
    st.subscription_type,
    (SELECT foo FROM zoo) e
from dim_member_subscription_all p,
     dim_subs_type
where a in (select moo from t10)

我有兴趣在此时获取表格。所以我想看看 [zoo,dim_member_subscription_all,dim_subs_type]& [T10]

我已经整理了一个小脚本,看着Paul McGuire的例子

#!/usr/bin/env python
import sys
import pprint
from pyparsing import *


pp = pprint.PrettyPrinter(indent=4)
semicolon = Combine(Literal(';') + lineEnd)
comma = Literal(',')
lparen = Literal('(')
rparen = Literal(')')

update_kw, volatile_kw, create_kw, table_kw, as_kw, from_kw, \
where_kw, join_kw, left_kw, right_kw, cross_kw, outer_kw, \
on_kw , insert_kw , into_kw= \
    map(lambda x: Keyword(x, caseless=True), \
        ['UPDATE', 'VOLATILE', 'CREATE', 'TABLE', 'AS', 'FROM',
         'WHERE', 'JOIN' , 'LEFT', 'RIGHT' , \
         'CROSS', 'OUTER', 'ON', 'INSERT', 'INTO'])

select_kw = Keyword('SELECT', caseless=True) | Keyword('SEL' , caseless=True)

reserved_words = (update_kw | volatile_kw | create_kw | table_kw | as_kw |
                  select_kw | from_kw | where_kw | join_kw |
                  left_kw | right_kw | cross_kw | on_kw | insert_kw |
                  into_kw)

ident = ~reserved_words + Word(alphas, alphanums + '_')

table = Combine(Optional(ident + Literal('.')) + ident)
column = Combine(Optional(ident + Literal('.')) + (ident | Literal('*')))

column_alias = Optional(Optional(as_kw).suppress() + ident)
table_alias = Optional(Optional(as_kw).suppress() + ident).suppress()

select_stmt = Forward()
nested_table = lparen.suppress() + select_stmt + rparen.suppress() + table_alias
table_list = delimitedList((nested_table | table) + table_alias)
column_list = delimitedList((nested_table | column) + column_alias)

txt = """
select
       ms.member_sk a,
       dd.date_sk b,
       st.subscription_type,
       (SELECT foo FROM zoo) e
from dim_member_subscription_all p,
     dim_subs_type
where a in (select moo from t10)
"""

select_stmt << select_kw.suppress() + column_list + from_kw.suppress() +  \
               table_list.setResultsName('tables', listAllMatches=True)

print txt

for token in select_stmt.searchString(txt):
    pp.pprint(token.asDict())

我得到以下嵌套输出。谁能帮助我理解我做错了什么?

{   'tables': ([(['zoo'], {}), (['dim_member_subscription_all', 'dim_subs_type'], {})], {})}
{   'tables': ([(['t10'], {})], {})}

1 个答案:

答案 0 :(得分:2)

searchString会返回所有匹配ParseResults的列表 - 您可以看到每个匹配的tables值:

for token in select_stmt.searchString(txt):
    print token.tables

,并提供:

[['zoo'], ['dim_member_subscription_all', 'dim_subs_type']]
[['t10']]

所以searchString发现了两个SELECT语句。

最新版本的pyparsing支持使用Python内置sum将此列表汇总为单个整合。访问此合并结果的tables值如下所示:

print sum(select_stmt.searchString(txt)).tables

[['zoo'], ['dim_member_subscription_all', 'dim_subs_type'], ['t10']]

我认为解析器正在做你想做的所有事情,你只需要弄清楚如何处理返回的结果。

为了进一步调试,您应该开始在ParseResults上使用dump方法来查看您将获得的内容,这将打印返回的令牌的嵌套列表,然后是所有命名结果的分层树。以你的例子:

for token in select_stmt.searchString(txt):
    print token.dump()
    print

打印:

['ms.member_sk', 'a', 'dd.date_sk', 'b', 'st.subscription_type', 'foo', 'zoo', 'dim_member_subscription_all', 'dim_subs_type']
- tables: [['zoo'], ['dim_member_subscription_all', 'dim_subs_type']]

['moo', 't10']
- tables: [['t10']]