Question

我试图使用pyparsing编写街道地址解析器。粘滞点已经在街道名称中捕获了多个Word，而没有贪婪地捕获后缀（例如AVE，BLVD，ST）。这就是我到目前为止所做的：

DIRS = ['NORTH', 'N', 'SOUTH', 'S', 'EAST', 'E', 'WEST', 'W']
SUFFIXES = ['ST', 'AVE', 'BLVD', 'RD']

primary_num = pp.Word(pp.alphanums)
predir = pp.Optional(pp.oneOf(DIRS) + pp.Optional(pp.Suppress('.')))
suffix = pp.Optional(pp.oneOf(SUFFIXES) + pp.Optional(pp.Suppress('.')))
postdir = pp.Optional(pp.oneOf(DIRS) + pp.Optional(pp.Suppress('.')))
street_name = pp.OneOrMore(~suffix + ~postdir + pp.Word(pp.alphanums))
line_1 = primary_num + predir + street_name + suffix + postdir

如果我对123 GEORGE WASHINGTON AVE这样做，我会收到一个错误：

pyparsing.ParseException: Found unwanted token, [{Re:('ST|AVE|BLVD|RD') [Suppress:(".")]}] (at char 4), (line:1, col:5)

这个错误听起来像GEORGE的G匹配Re:('ST|AVE|BLVD|RD')。有谁知道这里发生了什么？

Answer 1

首先向解析器添加一些调试：

predir.setName("predir").setDebug()
street_name.setName("street_name").setDebug()

您看到您正在匹配可选的predir。而不是在predir，suffix和postdir定义中定义可选性，而是在line_1的定义中将它们定义为可选。

line_1 = primary_num + pp.Optional(predir) + street_name + 
            pp.Optional(suffix) + pp.Optional(postdir)

您还会发现您对dirs和suffix的定义应将其视为关键字，而不是oneOf所做的Literal类型匹配。相反，请使用CaselessKeywords：

predir = pp.MatchFirst(map(pp.CaselessKeyword,DIRS)) + pp.Optional(pp.Suppress('.'))
suffix = pp.MatchFirst(map(pp.CaselessKeyword,SUFFIXES)) + pp.Optional(pp.Suppress('.'))
postdir = pp.MatchFirst(map(pp.CaselessKeyword,DIRS)) + pp.Optional(pp.Suppress('.'))

最后，看看将街道名称中的多个单词分组，以使它们与任何前置或后置元素分开：

street_name = pp.Group(pp.OneOrMore(~suffix + ~postdir + pp.Word(pp.alphanums)))

通过这些更改，我们可以得到以下结果：

['123', ['GEORGE', 'WASHINGTON'], 'AVE']

发现不需要的令牌

1 个答案: