Question

我尝试使用pyparsing来解析机器人框架，这是一个基于文本的DSL。 sytnax就像跟随（抱歉，但我认为在BNF中描述它有点困难）。机器人框架中的单行可能如下所示：

Library\tSSHClient    with name\tnode

\ t是tab，在机器人框架中，它被透明地转移到2“”（事实上，它只是调用str.replace（'\ t'，''）来替换选项卡，但它会实际修改它每行的长度，len（'\ t'）为1，但len（''）为2。）。在机器人中，使用2个或更多的空格和'\ t'来分割标记，如果单词之间只有1个空格，则这些单词被认为是标记组。

Library\tSSHClient    with name\tnode

如果正确解析，

实际上会拆分为以下标记：

 ['Library', 'SSHClient', 'with name', 'node']

由于“with”和“name”之间只有1个空格，因此解析器认为它属于组语法标记。

这是我的代码：

ParserElement.setDefaultWhitespaceChars('\r\n\t ')
source = "Library\tSSHClient    with name\tnode"
EACH_LINE = Optional(Word(" ")).leaveWhitespace().suppress() + \
            CaselessKeyword("library").suppress() + \
            OneOrMore((Word(alphas)) + White(max=1).setResultName('myValue')) +\
            SkipTo(LineEnd())

res = EACH_LINE.parseString(source)
print res.myValue

问题：

1）我已经设置了WhiteSpaces，如果我想要完全匹配2个或更多Whitespaces或一个或多个Tab，我认为代码需要：白色（ws =''，min = 2）|白色（ws ='\ t'，min = 1）但这会失败，所以我无法指定空白值？

2）有没有办法获得匹配的结果索引？我尝试了setParseAction，但似乎我无法通过此回调获取索引。我需要开始和结束索引来突出显示单词。

3）LineStart和LineEnd是什么意思？我打印这些值，似乎它们只是正常的字符串，我是否必须在一行前面写一些内容，如： LineStart（）+ balabala ... + LineEnd（）？

但是，谢谢，我无法将'\ t'替换为''

from pyparsing import *

source = "Library\tsshclient\t\t\twith name    s1"

value = Combine(OneOrMore(Word(printables) | White(' ', max=1) + ~White()))  #here it seems the whitespace has already been set to ' ', why the result still match '\t'?

linedefn = OneOrMore(value)

res = linedefn.parseString(source)

print res

我得到了

['Library sshclient', 'with name', 's1']

但我期待 ['Library'，'sshclient'，'with name'，'s1']

Answer 1

当空白进入解析的标记时，我总是退缩，但是你的约束只允许单个空格，这应该是可行的。我使用以下表达式来定义可能嵌入单个空格的值：

# each value consists of printable words separated by at most a 
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(Word(printables) | White(' ',max=1) + ~White()))

完成此操作后，一行只是这些值中的一个或多个：

linedefn = OneOrMore(value)

按照您的示例，包括调用str.replace以使用空格对替换制表符，代码如下所示：

data = "Library\tSSHClient    with name\tnode"

# replace tabs with 2 spaces
data = data.replace('\t', '  ')

print linedefn.parseString(data)

，并提供：

['Library', 'SSHClient', 'with name', 'node']

要获取原始字符串中任何值的起始位置和结束位置，请将表达式包装在新的pyparsing辅助方法locatedExpr中：

# use new locatedExpr to get the value, start, and end location 
# for each value
linedefn = OneOrMore(locatedExpr(value))('values')

如果我们解析并转储结果：

print linedefn.parseString(data).dump()

我们得到：

- values: 
  [0]:
    [0, 'Library', 7]
    - locn_end: 7
    - locn_start: 0
    - value: Library
  [1]:
    [9, 'SSHClient', 18]
    - locn_end: 18
    - locn_start: 9
    - value: SSHClient
  [2]:
    [22, 'with name', 31]
    - locn_end: 31
    - locn_start: 22
    - value: with name
  [3]:
    [33, 'node', 37]
    - locn_end: 37
    - locn_start: 33
    - value: node

LineStart和LineEnd是pyparsing表达式类，其实例应该在行的开头和结尾匹配。 LineStart一直很难使用，但LineEnd是可以预测的。在您的情况下，如果您一次只读取并解析一行，那么您不应该需要它们 - 只需定义您期望的行的内容。如果您想确保解析器已经处理了整个字符串（并且因为不匹配的字符而没有停止），请将+ LineEnd()或+ StringEnd()添加到解析器的末尾，或者将参数parseAll=True添加到您对parseString()的调用中。

编辑：

很容易忘记pyparsing默认调用str.expandtabs - 你必须通过调用parseWithTabs来禁用它。并明确禁止值单词之间的TAB可以解决您的问题，并将值保持在正确的字符数。请参阅以下更改：

from pyparsing import *
TAB = White('\t')

# each value consists of printable words separated by at most a 
# single space (a space that is not followed by another space)
value = Combine(OneOrMore(~TAB + (Word(printables) | White(' ',max=1) + ~White())))

# each line has one or more of these values
linedefn = OneOrMore(value)
# do not expand tabs before parsing
linedefn.parseWithTabs()


data = "Library\tSSHClient    with name\tnode"

# replace tabs with 2 spaces
#data = data.replace('\t', '  ')

print linedefn.parseString(data)


linedefn = OneOrMore(locatedExpr(value))('values')
# do not expand tabs before parsing
linedefn.parseWithTabs()
print linedefn.parseString(data).dump()

pyparsing空白匹配问题

1 个答案: