我无法弄清楚如何使用pyparsing对文本中的零个或多个重复部分进行分组。换句话说,我想将多个匹配合并到一个命名结果集中。注意,我想使用pyparsing,因为我有很多不同规则的部分。
from pyparsing import *
input_text = """
Projects
project a created in c#
Education
university of college
Projects
project b created in python
"""
project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker
project_section = Group(
project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects')
education_section = Group(
education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations')
sections = project_section ^ education_section
text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)
print(result)
# ['', ['Projects', '\n', 'project a created in c#'], ['Education', '\n', 'virginia tech'], ['Projects', '\n', 'project b created in python']]
print(result.projects)
# ['Projects', '\n', 'project b created in python']
print(result.projects[0].project)
# AttributeError: 'str' object has no attribute 'project'
答案 0 :(得分:2)
这是我的试探性答案,而不是我为此感到骄傲。我从https://stackoverflow.com/a/5824309/131187中榨取了一大块。
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])('section') + EOL + lines('lines')
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
正如您可以在这句话下方看到的那样,解析器成功地正确地收集信息,并以可以组装的方式收集信息,如现在所示。但是,我找不到一种方法来访问parseString
中明显可用的所有结果。
我尝试将eval
应用于其repr
代表。完成后,我能够挑选出所有碎片并将它们分配给类似dict的物体。
老实说,没有pyparsing,这会更容易。阅读一行,注意它是否是关键字。如果是,请记住它。然后,直到您阅读另一个关键字,只需将您在字典中读取的所有行放在最新的关键字下。
>>> repr(r)
"(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})"
>>> evil_r = eval(repr(r))
>>> evil_r
(['Projects', 'project 1', 'project 2', 'project 3', '', 'Education', 'institution 1', 'institution 2', 'institution 3', 'institution 4', '', 'Projects', 'assignment 5', 'assignment 8', 'assignment 10', ''], {'lines': [(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})], 'section': ['Projects', 'Education', 'Projects']})
>>> evil_r[1]['lines']
[(['project 1', 'project 2', 'project 3', ''], {}), (['institution 1', 'institution 2', 'institution 3', 'institution 4', ''], {}), (['assignment 5', 'assignment 8', 'assignment 10', ''], {})]
>>> evil_r[1]['section']
['Projects', 'Education', 'Projects']
>>> from collections import defaultdict
>>> section_info = defaultdict(list)
>>> for k, kind in enumerate(evil_r[1]['section']):
... section_info[kind].extend(evil_r[1]['lines'][k][0][:-1])
>>> for section in section_info:
... section, section_info[section]
...
('Education', ['institution 1', 'institution 2', 'institution 3', 'institution 4'])
('Projects', ['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10'])
编辑:或者你可以这样做。需要收拾整理。至少它没有使用任何非正统的东西。
>>> input_text = open('temp.txt').read()
>>> import pyparsing as pp
>>> pp.ParserElement.setDefaultWhitespaceChars(" \t")
>>> from collections import defaultdict
>>> class Accum:
... def __init__(self):
... self.current_section = None
... self.result = defaultdict(list)
... def __call__(self, s):
... if s[0] in ['Projects', 'Education']:
... self.current_section = s[0]
... else:
... self.result[self.current_section].extend(s[:-1])
...
>>> accum = Accum()
>>> EOL = pp.LineEnd().suppress()
>>> keyword = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')])
>>> line = pp.LineStart() + pp.NotAny(keyword) + pp.SkipTo(pp.LineEnd(), failOn=pp.LineStart()+pp.LineEnd()) + EOL
>>> lines = pp.OneOrMore(line)
>>> section = pp.Or([pp.Keyword('Projects'), pp.Keyword('Education')]).setParseAction(accum) + EOL + lines.setParseAction(accum)
>>> sections = pp.OneOrMore(section)
>>> r = sections.parseString(input_text)
>>> accum.result['Education']
['institution 1', 'institution 2', 'institution 3', 'institution 4']
>>> accum.result['Projects']
['project 1', 'project 2', 'project 3', 'assignment 5', 'assignment 8', 'assignment 10']
答案 1 :(得分:0)
感谢@PaulMcG解决方案是将listAllMatches=True
添加到setResultsName
,请参阅https://pythonhosted.org/pyparsing/pyparsing.ParserElement-class.html#setResultsName。
project_marker = LineStart() + Literal('Projects') + LineEnd()
education_marker = LineStart() + Literal('Education') + LineEnd()
markers = project_marker ^ education_marker
project_section = Group(
project_marker + SkipTo(markers | stringEnd).setResultsName('project')
).setResultsName('projects', listAllMatches=True)
education_section = Group(
education_marker + SkipTo(markers | stringEnd).setResultsName('education')
).setResultsName('educations', listAllMatches=True)
sections = project_section ^ education_section
text = StringStart() + SkipTo(sections | StringEnd())
doc = Optional(text) + ZeroOrMore(sections)
result = doc.parseString(input_text)