我想解析化学元素数据库的查询。
数据库存储在xml文件中。解析该文件会生成嵌套字典,该字典存储在从collections.OrderedDict继承的单例对象中。
要求元素将为我提供其相应属性的有序字典 (即ELEMENTS ['C'] - > {'name':'carbon','neutron':0,'proton':6,...})。
相反,要求一个属性将给我一个有序的字典,其中包含所有元素的值(即ELEMENTS ['proton'] - > {'H':1,'他':2} ... )。
典型的查询可能是:
mass > 10 or (nucleon < 20 and atomic_radius < 5)
其中每个“子查询”(即质量> 10)将返回与其匹配的元素集。
然后,查询将被转换并在内部转换为一个字符串,该字符串将进一步评估以生成与其匹配的元素的一组索引。在该上下文中,运算符和/或不是布尔运算符,而是作用于python集的集合运算符。
我最近发了一篇建立此类查询的帖子。感谢我得到的有用答案,我认为我做了或多或少的工作(我希望以一种不错的方式!)但我仍然有一些与pyparsing有关的问题。
这是我的代码:
import numpy
from pyparsing import *
# This import a singleton object storing the datase dictionary as
# described earlier
from ElementsDatabase import ELEMENTS
and_operator = oneOf(['and','&'], caseless=True)
or_operator = oneOf(['or' ,'|'], caseless=True)
# ELEMENTS.properties is a property getter that returns the list of
# registered properties in the database
props = oneOf(ELEMENTS.properties, caseless=True)
# A property keyword can be quoted or not.
props = Suppress('"') + props + Suppress('"') | props
# When parsed, it must be replaced by the following expression that
# will be eval later.
props.setParseAction(lambda t : "numpy.array(ELEMENTS['%s'].values())" % t[0].lower())
quote = QuotedString('"')
integer = Regex(r'[+-]?\d+').setParseAction(lambda t:int(t[0]))
float_ = Regex(r'[+-]?(\d+(\.\d*)?)?([eE][+-]?\d+)?').setParseAction(lambda t:float(t[0]))
comparison_operator = oneOf(['==','!=','>','>=','<', '<='])
comparison_expr = props + comparison_operator + (quote | float_ | integer)
comparison_expr.setParseAction(lambda t : "set(numpy.where(%s)%s%s)" % tuple(t))
grammar = Combine(operatorPrecedence(comparison_expr, [(and_operator, 2, opAssoc.LEFT) (or_operator, 2, opAssoc.LEFT)]))
# A test query
res = grammar.parseString('"mass " > 30 or (nucleon == 1)',parseAll=True)
print eval(' '.join(res._asStringList()))
我的问题如下:
1 using 'transformString' instead of 'parseString' never triggers any
exception even when the string to be parsed does not match the grammar.
However, it is exactly the functionnality I need. Is there is a way to do so ?
2 I would like to reintroduce white spaces between my tokens in order
that my eval does not fail. The only way I found to do so it the one
implemented above. Would you see a better way using pyparsing ?
对于这篇长篇文章感到抱歉,但我想更详细地介绍它的背景。顺便说一句,如果你发现这种方法不好,请不要犹豫告诉我!
非常感谢你的帮助。
埃里克
答案 0 :(得分:1)
基本上,我使用了以下方法:
1 for each subquery (i.e. mass > 10), using the setParseAction method,
I joined a function that returns the set of eleements that matched
the subquery
2 then, I joined the following functions for each logical operator (and,
or and not):
def not_operator(token):
_, s = token[0]
# ELEMENTS is the singleton described in my original post
return set(ELEMENTS.keys()).difference(s)
def and_operator(token):
s1, _, s2 = token[0]
return (s1 and s2)
def or_operator(token):
s1, _, s2 = token[0]
return (s1 or s2)
# Thanks for Paul for the hint.
grammar = operatorPrecedence(comparison_expr,
[(not_token, 1,opAssoc.RIGHT,not_operator),
(and_token, 2, opAssoc.LEFT,and_operator),
(or_token, 2, opAssoc.LEFT,or_operator)])
Please not that these operators acts upon python sets rather than
on booleans.
这就是工作。
我希望这种方法可以帮助你们。
Eric