Question

假设我有一个字符串s = '{aaaa{bc}xx{d{e}}f}'，它具有嵌套列表的结构。我想为其提供层次结构表示，同时能够访问与有效子列表相对应的子字符串。为了简单起见，让我们忘记层次结构，我只想要一个与有效子列表相对应的子字符串列表，例如：

['{aaaa{bc}xx{d{e}}f}', '{bc}', '{d{e}}', '{e}']

使用nestedExpr，可以获得包含所有有效子列表的嵌套结构：

import pyparsing as pp

s = '{aaaa{bc}xx{d{e}}f}'
not_braces = pp.CharsNotIn('{}')
expr = pp.nestedExpr('{', '}', content=not_braces)
res = expr('L0 Contents').parseString(s)
print(res.dump())

打印：

[['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
- L0 Contents: [['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
  [0]:
    ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
    [0]:
      aaaa
    [1]:
      ['bc']
    [2]:
      xx
    [3]:
      ['d', ['e']]
      [0]:
        d
      [1]:
        ['e']
    [4]:
      f

为了获得已解析元素的原始字符串表示形式，我必须将其包装到pyparsing.originalTextFor()中。但是，这将从结果中删除所有子列表：

s = '{aaaa{bc}xx{d{e}}f}'
not_braces = pp.CharsNotIn('{}')
expr = pp.nestedExpr('{', '}', content=not_braces)
res = pp.originalTextFor(expr)('L0 Contents').parseString(s)
print(res.dump())

打印：

['{aaaa{bc}xx{d{e}}f}']
- L0 Contents: '{aaaa{bc}xx{d{e}}f}'

实际上，originalTextFor()包装器将其中的所有内容弄平了。

问题。originalTextFor()是否可以保留其子解析元素的结构？（最好有一个非丢弃的类似物，可以将其用于为已解析的子表达式创建命名令牌）

请注意，scanString()仅会给我0级子列表，而不会进入内部。我想我可以使用setParseAction()，但是ParserElement的内部操作模式尚未记录，并且我还没有机会深入研究源代码。谢谢！

更新1.有点相关：https://stackoverflow.com/a/39885391/11932910 https://stackoverflow.com/a/17411455/11932910

Answer 1

将originalTextFor表达式包装在nestedExpr中，而不是使用locatedExpr：

import pyparsing as pp
parser = pp.locatedExpr(pp.nestedExpr('{','}'))

locatedExpr将返回3元素的ParseResults：

开始位置
解析值
结束位置

然后可以将解析动作附加到此解析器以修改解析的令牌，并添加自己的original_string命名结果，其中包含从输入字符串中切出的原始文本：

def extract_original_text(st, loc, tokens):
    start, tokens[:], end = tokens[0]
    tokens['original_string'] = st[start:end]
parser.addParseAction(extract_original_text)

现在使用此解析器来解析和转储结果：

result = parser.parseString(s)
print(result.dump())

打印：

['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
- original_string: '{aaaa{bc}xx{d{e}}f}'

并使用以下命令访问original_string结果：

print(result.original_string)

编辑-如何将original_string附加到每个嵌套的子结构

要在子结构上维护原始字符串，需要的工作比仅nested_expr所能完成的要多。您几乎必须实现自己的递归解析器。

要实现自己的nested_expr版本，您将从以下内容开始：

LBRACE, RBRACE = map(pp.Suppress, "{}")
expr = pp.Forward()

term = pp.Word(pp.alphas)
expr_group = pp.Group(LBRACE + expr + RBRACE)
expr_content = term | expr_group

expr <<= expr_content[...]

print(expr.parseString(sample).dump())

这将转储已解析的结果，而没有'original_string'名称：

{aaaa{bc}xx{d{e}}f}
[['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
[0]:
  ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
  [0]:
    aaaa
  [1]:
    ['bc']
  [2]:
    xx
  [3]:
    ['d', ['e']]
    [0]:
      d
    [1]:
      ['e']
  [4]:
    f

要添加“原始字符串”名称，我们首先将Group更改为locatedExpr包装器。

expr_group = pp.locatedExpr(LBRACE + expr + RBRACE)

这会将开始和结束位置添加到每个嵌套子组（使用nestedExpr时您将无法访问）。

{aaaa{bc}xx{d{e}}f}
[[0, 'aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f', 19]]
[0]:
  [0, 'aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f', 19]
  - locn_end: 19
  - locn_start: 0
  - value: ['aaaa', [5, 'bc', 9], 'xx', [11, 'd', [13, 'e', 16], 17], 'f']
    [0]:
      aaaa
    [1]:
      [5, 'bc', 9]
      - locn_end: 9
      - locn_start: 5
      - value: ['bc']
...

我们的解析动作现在也更加复杂。

def extract_original_text(st, loc, tokens):
    # pop/delete names and list items inserted by locatedExpr
    # (save start and end locations to local vars)
    tt = tokens[0]
    start = tt.pop("locn_start")
    end = tt.pop("locn_end")
    tt.pop("value")
    del tt[0]
    del tt[-1]

    # add 'original_string' results name
    orig_string = st[start:end]
    tt['original_string'] = orig_string

expr_group.addParseAction(extract_original_text)

通过此更改，您现在将获得以下结构：

{aaaa{bc}xx{d{e}}f}
[['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']]
[0]:
  ['aaaa', ['bc'], 'xx', ['d', ['e']], 'f']
  - original_string: '{aaaa{bc}xx{d{e}}f}'
  [0]:
    aaaa
  [1]:
    ['bc']
    - original_string: '{bc}'
  [2]:
    xx
  [3]:
    ['d', ['e']]
    - original_string: '{d{e}}'
    [0]:
      d
    [1]:
      ['e']
      - original_string: '{e}'
  [4]:
    f

注意：当前版本的ParseResults.dump中有一个限制，仅显示键或子项，但不同时显示-此输出需要修复程序来消除该限制，然后在下一个版本中发布pyparsing版本。但是，即使dump（）没有显示这些子结构，它们仍在您的实际结构中，如您所见，您是否可以打印出结果的代表：

print(repr(result[0]))

(['aaaa', (['bc'], {'original_string': '{bc}'}), 'xx', (['d', (['e'], {'original_string': '{e}'})], {'original_string': '{d{e}}'}), 'f'], {'original_string': '{aaaa{bc}xx{d{e}}f}'})

解析嵌套列表并为每个有效列表返回原始字符串

1 个答案: