Question

我无法弄清楚如何使用pyparsing对文本中的零个或多个重复部分进行分组。换句话说，我想将多个匹配合并到一个命名结果集中。注意，我想使用pyparsing，因为我有很多不同规则的部分。

from pyparsing import *    

input_text = """
Projects
project a created in c#

Education
university of college

Projects
project b created in python
"""

project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects')
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations')
sections = project_section ^ education_section

text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)

print(result)
# ['', ['Projects', '\n', 'project a created in c#'], ['Education', '\n', 'virginia tech'], ['Projects', '\n', 'project b created in python']]
print(result.projects)
# ['Projects', '\n', 'project b created in python']
print(result.projects[0].project)
# AttributeError: 'str' object has no attribute 'project'

Answer 1

这是我的试探性答案，而不是我为此感到骄傲。我从https://stackoverflow.com/a/5824309/131187中榨取了一大块。

>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])('section') + EOL + lines('lines')
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)

正如您可以在这句话下方看到的那样，解析器成功地正确地收集信息，并以可以组装的方式收集信息，如现在所示。但是，我找不到一种方法来访问parseString中明显可用的所有结果。

我尝试将eval应用于其repr代表。完成后，我能够挑选出所有碎片并将它们分配给类似dict的物体。

老实说，没有pyparsing，这会更容易。阅读一行，注意它是否是关键字。如果是，请记住它。然后，直到您阅读另一个关键字，只需将您在字典中读取的所有行放在最新的关键字下。

>>> repr(r)
"(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})"
>>> evil_r = eval(repr(r))
>>> evil_r
(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})
>>> evil_r[1]['lines']
[(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})]
>>> evil_r[1]['section']
['Projects', 'Education', 'Projects']
>>> from collections import defaultdict
>>> section_info = defaultdict(list)
>>> for k, kind in enumerate(evil_r[1]['section']):
...     section_info[kind].extend(evil_r[1]['lines'][k][0][:-1])
>>> for section in section_info:
...     section, section_info[section]
...     
('Education', ['institution 1', 'institution 2', 'institution 3', 'institution 4'])
('Projects', ['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10'])

编辑：或者你可以这样做。需要收拾整理。至少它没有使用任何非正统的东西。

>>> input_text = open('temp.txt').read()
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> from collections import defaultdict
>>> class Accum:
...     def __init__(self):
...         self.current_section = None
...         self.result = defaultdict(list)
...     def __call__(self, s):
...         if s[0] in ['Projects', 'Education']:
...             self.current_section = s[0]
...         else:
...             self.result[self.current_section].extend(s[:-1])
... 
>>> accum = Accum()
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]).setParseAction(accum) + EOL + lines.setParseAction(accum)
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
>>> accum.result['Education']
['institution 1', 'institution 2', 'institution 3', 'institution 4']
>>> accum.result['Projects']
['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10']

Answer 2

感谢@PaulMcG解决方案是将listAllMatches=True添加到setResultsName，请参阅https://pythonhosted.org/pyparsing/pyparsing.ParserElement-class.html#setResultsName。

project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker

project_section = Group(
    project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects', listAllMatches=True)
education_section = Group(
    education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations', listAllMatches=True)
sections = project_section ^ education_section

text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)

使用Pyparsing对多个部分（匹配）进行分组

2 个答案: