我正在尝试使用NTLK语法和解析算法,因为它们看起来非常简单易用。虽然,我找不到正确匹配字母数字字符串的方法,例如:
import nltk
grammar = nltk.parse_cfg ("""
# Is this possible?
TEXT -> \w*
""")
parser = nltk.RecursiveDescentParser(grammar)
print parser.parse("foo")
有没有简单的方法来实现这一目标?
答案 0 :(得分:2)
干净利落是非常困难的。基本解析器类依赖于精确匹配或生产RHS来弹出内容,因此需要子类化并重写解析器类的大部分内容。我前一段时间尝试过功能语法课并放弃了。
我所做的更多是黑客,但基本上,我首先从文本中提取正则表达式匹配,并将它们作为产品添加到语法中。如果你使用大语法会很慢,因为它需要为每次调用重新计算语法和解析器。
import re
import nltk
from nltk.grammar import Nonterminal, Production, ContextFreeGrammar
grammar = nltk.parse_cfg ("""
S -> TEXT
TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT
""")
productions = grammar.productions()
def literal_production(key, rhs):
""" Return a production <key> -> n
:param key: symbol for lhs:
:param rhs: string literal:
"""
lhs = Nonterminal(key)
return Production(lhs, [rhs])
def parse(text):
""" Parse some text.
"""
# extract new words and numbers
words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)])
numbers = set([match.group(0) for match in re.finditer(r"\d+", text)])
# Make a local copy of productions
lproductions = list(productions)
# Add a production for every words and number
lproductions.extend([literal_production("WORD", word) for word in words])
lproductions.extend([literal_production("NUMBER", number) for number in numbers])
# Make a local copy of the grammar with extra productions
lgrammar = ContextFreeGrammar(grammar.start(), lproductions)
# Load grammar into a parser
parser = nltk.RecursiveDescentParser(lgrammar)
tokens = text.split()
return parser.parse(tokens)
print parse("foo hello world 123 foo")
以下是有关Google群组中nltk-users群组讨论的更多背景信息:https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion