使用Pyparsing对多个部分(匹配)进行分组

时间:2017-08-17 15:27:47

标签: python pyparsing

我无法弄清楚如何使用pyparsing对文本中的零个或多个重复部分进行分组。换句话说,我想将多个匹配合并到一个命名结果集中。注意,我想使用pyparsing,因为我有很多不同规则的部分。

from pyparsing import *    

input_text = """
Projects
project a created in c#

Education
university of college

Projects
project b created in python
"""

project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects')
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations')
sections = project_section ^ education_section

text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)

print(result)
# ['', ['Projects', '\n', 'project a created in c#'], ['Education', '\n', 'virginia tech'], ['Projects', '\n', 'project b created in python']]
print(result.projects)
# ['Projects', '\n', 'project b created in python']
print(result.projects[0].project)
# AttributeError: 'str' object has no attribute 'project'

2 个答案:

答案 0 :(得分:2)

这是我的试探性答案,而不是我为此感到骄傲。我从https://stackoverflow.com/a/5824309/131187中榨取了一大块。

>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])('section') + EOL + lines('lines')
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)

正如您可以在这句话下方看到的那样,解析器成功地正确地收集信息,并以可以组装的方式收集信息,如现在所示。但是,我找不到一种方法来访问parseString中明显可用的所有结果。

我尝试将eval应用于其repr代表。完成后,我能够挑选出所有碎片并将它们分配给类似dict的物体。

老实说,没有pyparsing,这会更容易。阅读一行,注意它是否是关键字。如果是,请记住它。然后,直到您阅读另一个关键字,只需将您在字典中读取的所有行放在最新的关键字下。

>>> repr(r)
"(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})"
>>> evil_r = eval(repr(r))
>>> evil_r
(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})
>>> evil_r[1]['lines']
[(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})]
>>> evil_r[1]['section']
['Projects', 'Education', 'Projects']
>>> from collections import defaultdict
>>> section_info = defaultdict(list)
>>> for k, kind in enumerate(evil_r[1]['section']):
...     section_info[kind].extend(evil_r[1]['lines'][k][0][:-1])
>>> for section in section_info:
...     section, section_info[section]
...     
('Education', ['institution 1', 'institution 2', 'institution 3', 'institution 4'])
('Projects', ['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10'])

编辑:或者你可以这样做。需要收拾整理。至少它没有使用任何非正统的东西。

>>> input_text = open('temp.txt').read()
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> from collections import defaultdict
>>> class Accum:
...     def __init__(self):
...         self.current_section = None
...         self.result = defaultdict(list)
...     def __call__(self, s):
...         if s[0] in ['Projects', 'Education']:
...             self.current_section = s[0]
...         else:
...             self.result[self.current_section].extend(s[:-1])
... 
>>> accum = Accum()
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]).setParseAction(accum) + EOL + lines.setParseAction(accum)
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
>>> accum.result['Education']
['institution 1', 'institution 2', 'institution 3', 'institution 4']
>>> accum.result['Projects']
['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10']

答案 1 :(得分:0)

感谢@PaulMcG解决方案是将listAllMatches=True添加到setResultsName,请参阅https://pythonhosted.org/pyparsing/pyparsing.ParserElement-class.html#setResultsName

project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects', listAllMatches=True)
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations', listAllMatches=True)
sections = project_section ^ education_section

text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)