Question

我有一个格式如下的文件：

#
here
are
some
strings
#
and
some
others
 #
 with
 different
 levels
 #
 of
  #
  indentation
  #
 #
#

因此，块由起始#和尾随#定义。但是，第n个块的尾随#也是第n个块的起始#。

我正在尝试编写一个函数，在给定此格式的情况下，将检索每个块的内容，这也可以是递归的。

首先，我开始使用正则表达式，但我放弃了很快（我想你猜对了），所以我尝试使用pyparsing，但我不能简单地写

print(nestedExpr('#','#').parseString(my_string).asList())

因为它会引发ValueError异常（ValueError: opening and closing strings cannot be the same）。

知道我无法更改输入格式，对于这个格式，我有比pyparsing更好的选择吗？

我也尝试使用此答案：https://stackoverflow.com/a/1652856/740316，并将{ / }替换为#/#，但无法解析字符串。

Answer 1

不幸的是（对你而言），你的分组不仅仅依赖于分离的'＃'字符，而且还依赖于缩进级别（否则，['with','different','levels']将与前一组{{1}处于同一级别}}）。解析缩进敏感的语法并不适合于pyparsing - 它可以完成，但它并不令人愉快。为此，我们将使用pyparsing helper宏['and','some','others']，这也要求我们定义indentedBlock可用于其缩进堆栈的列表变量。

请参阅下面代码中的嵌入式评论，了解如何使用一种方法进行pyparsing和indentedBlock：

indentedBlock

打印：

from pyparsing import *

test = """\
#
here
are
some
strings
#
and
some
others
 #
 with
 different
 levels
 #
 of
  #
  indentation
  #
 #
#"""

# newlines are significant for line separators, so redefine 
# the default whitespace characters for whitespace skipping
ParserElement.setDefaultWhitespaceChars(' ')

NL = LineEnd().suppress()
HASH = '#'
HASH_SEP = Suppress(HASH + Optional(NL))

# a normal line contains a single word
word_line = Word(alphas) + NL


indent_stack = [1]

# word_block is recursive, since word_blocks can contain word_blocks
word_block = Forward()
word_group = Group(OneOrMore(word_line | ungroup(indentedBlock(word_block, indent_stack))) )

# now define a word_block, as a '#'-delimited list of word_groups, with 
# leading and trailing '#' characters
word_block <<= (HASH_SEP + 
                 delimitedList(word_group, delim=HASH_SEP) + 
                 HASH_SEP)

# the overall expression is one large word_block
parser = word_block

# parse the test string
parser.parseString(test).pprint()

使用起始和结束字符串的pyparsing是相同的

1 个答案: