我正在尝试使用pyparsing来解析可能嵌套的化学式和使用pyparsing的非整数化学计量。我想要的是公式中存在的每个元素的列表及其相应的总化学计量。
我在pyparsing wiki上使用了这个例子作为开始,并查看了fourFn.py以获取更多想法。我无法理解如何使用包中的所有功能。
我提出了以下语法:
from pyparsing import Word, Group, ZeroOrMore, Combine,\
Optional, OneOrMore, ParseException, Literal, nums,\
Suppress, Dict, Forward
caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()
element = Word( caps, lowers )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )
nreal = (Combine( integer + Optional( separator +\
Optional( integer ) ))\
| Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )
block = Forward()
groupElem = Group( element + Optional( nreal, default=1)) ^ \
Group( parl + block + parr + Optional( nreal,default=1 ) )
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )
非嵌套公式按预期工作:
>>> formula.parseString('H2O')
([(['H', 2.0], {}), (['O', 1], {})], {})
尽管有那些空字段(我无法找到它),我可以提取我想要的信息。
但是当我尝试这样的事情时:
>>> formula.parseString('C6H8(OH)4')
([(['C', 6.0], {}), (['H', 8.0], {}), ([(['O', 1], {}), (['H', 1], {}), 4.0], {})], {})
我可以看到公式被正确解析,但我希望(OH)4中的'4'外部数字乘以内部数字。但我看不出怎么做。
一个令牌如何改变另一个令牌的价值?
或者我如何处理这些结果并创建一个函数,如果一个块附加了一个外部编号,我可以计算块内每个元素的总数?
提前致谢。
edit1:我相信我需要这样的东西:在“(block)nreal”出现时抑制外部nreal,并将所有出现的nreal乘以外部值......
答案 0 :(得分:3)
解决这个问题绝对需要递归。在pyparsing中,您使用Forward
类定义递归语法。请参阅此代码示例中的注释:
from pyparsing import (Suppress, Word, nums, alphas, Regex, Forward, Group,
Optional, OneOrMore, ParseResults)
from collections import defaultdict
"""
BNF for simple chemical formula (no nesting)
integer :: '0'..'9'+
element :: 'A'..'Z' 'a'..'z'*
term :: element [integer]
formula :: term+
BNF for nested chemical formula
integer :: '0'..'9'+
element :: 'A'..'Z' 'a'..'z'*
term :: (element | '(' formula ')') [integer]
formula :: term+
"""
LPAR,RPAR = map(Suppress,"()")
integer = Word(nums)
# add parse action to convert integers to ints, to support doing addition
# and multiplication at parse time
integer.setParseAction(lambda t:int(t[0]))
element = Word(alphas.upper(), alphas.lower())
# or if you want to be more specific, use this Regex
# element = Regex(r"A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|"
# "G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|"
# "Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|"
# "Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]")
# forward declare 'formula' so it can be used in definition of 'term'
formula = Forward()
term = Group((element | Group(LPAR + formula + RPAR)("subgroup")) +
Optional(integer, default=1)("mult"))
# define contents of a formula as one or more terms
formula << OneOrMore(term)
# add parse actions for parse-time processing
# parse action to multiply out subgroups
def multiplyContents(tokens):
t = tokens[0]
# if these tokens contain a subgroup, then use multiplier to
# extend counts of all elements in the subgroup
if t.subgroup:
mult = t.mult
for term in t.subgroup:
term[1] *= mult
return t.subgroup
term.setParseAction(multiplyContents)
# add parse action to sum up multiple references to the same element
def sumByElement(tokens):
elementsList = [t[0] for t in tokens]
# construct set to see if there are duplicates
duplicates = len(elementsList) > len(set(elementsList))
# if there are duplicate element names, sum up by element and
# return a new nested ParseResults
if duplicates:
ctr = defaultdict(int)
for t in tokens:
ctr[t[0]] += t[1]
return ParseResults([ParseResults([k,v]) for k,v in ctr.items()])
formula.setParseAction(sumByElement)
# run some tests
tests = """\
H
NaCl
HO
H2O
HOH
(H2O)2
(H2O)2OH
((H2O)2OH)12
C6H5OH
""".splitlines()
for t in tests:
if t.strip():
results = formula.parseString(t)
print t, '->', dict(results.asList())
打印出来:
H -> {'H': 1}
NaCl -> {'Na': 1, 'Cl': 1}
HO -> {'H': 1, 'O': 1}
H2O -> {'H': 2, 'O': 1}
HOH -> {'H': 2, 'O': 1}
(H2O)2 -> {'H': 4, 'O': 2}
(H2O)2OH -> {'H': 5, 'O': 3}
((H2O)2OH)12 -> {'H': 60, 'O': 36}
C6H5OH -> {'H': 6, 'C': 6, 'O': 1}
答案 1 :(得分:1)
我想我自己找到了解决方案。我必须创建一个递归函数来分析结果并输出我想要的列表,每个元素及其化学计量没有嵌套。我不得不略微修改我的起始代码,并将命名结果用于我的目的:
from pyparsing import Word, Group, ZeroOrMore, Combine,\
Optional, OneOrMore, ParseException, Literal, nums,\
Suppress, Dict, Forward
caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()
element = Word( caps, lowers )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )
nreal = (Combine( integer + Optional( separator +\
Optional( integer ) ))\
| Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )
block = Forward()
groupElem = (Group( element('elem') + Optional( nreal, default=1)('esteq') ))('dupla') | \
Group( parl + block + parr + Optional( nreal,default=1 )('modi'))
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )
这是我的功能。我希望它可以帮助有类似问题的人。我认为这个解决方案非常难看......如果有人有更好,更优雅的解决方案,我全都听见了!
def solu(formula):
final = []
def diver(entr,mult=1):
resul = list()
# If modi is empty, it is an enclosed group
# And we must multiply everything inside by modi
if entr.modi != '':
for y in entr:
try:
resul.append(diver(y,entr.modi))
except AttributeError:
pass
# Else, it is just an atom, and we return it
else:
resul.append(entr.elem)
resul.append(entr.esteq*mult)
return resul
def doubles(entr):
resul = []
# If entr does not contain lists
# It is an atom
if sum([1 for y in entr if isinstance(y,list)]) == 0:
final.append(entr)
return entr
else:
# And if it isn't an atom? We dive further
# and call doubles until it is an atom
for y in entr:
doubles(y)
for member in formula:
# If member is already an atom, add it directly to final
if sum([1 for x in diver(member) if isinstance(x,list)]) == 0:
final.append(diver(member))
else:
# If not, call doubles on the clean member (without modi)
# and it takes care of adding atoms to final
doubles(diver(member))
return final
最后,解决方法可以解决问题:
>>> solu(formula.parseString('C6H8(OH)4'))
[['C', 6.0], ['H', 8.0], ['O', 4.0], ['H', 4.0]]