我正在尝试使用NLTK包来捕获句子中的以下块:
verb + smth + noun
或者可能是
verb + smth + noun + and + noun
我如实地花了整整一天搞乱正则表达式,但仍然没有产生任何正确的......
我正在查看this教程,但这并没有多大帮助。
答案 0 :(得分:2)
当你知道可能介于两者之间的某些东西时,有一种使用NLTK的CFG的相对简单的方法。这当然不是最有效的方式。有关综合分析,请参阅NLTK关于chapter 8的书。
我们提到了两种模式:
<verb> ... <noun>
<verb> ... <noun> "and" <noun>
我们应该汇总VP和NP的列表,以及可能发生在两者之间的可能单词的范围。作为一个愚蠢的小例子:
grammar = nltk.CFG.fromstring("""
% start S
S -> VP SOMETHING NP
VP -> V
SOMETHING -> WORDS SOMETHING
SOMETHING ->
NP -> N 'and' N
NP -> N
V -> 'told' | 'scolded' | 'loved' | 'respected' | 'nominated' | 'rescued' | 'included'
N -> 'this' | 'us' | 'them' | 'you' | 'I' | 'me' | 'him'|'her'
WORDS -> 'among' | 'others' | 'not' | 'all' | 'of'| 'uhm' | '...' | 'let'| 'finish' | 'certainly' | 'maybe' | 'even' | 'me'
""")
现在假设这是我们想要使用过滤器的句子列表:
sentences = ['scolded me and you', 'included certainly uhm maybe even her and I', 'loved me and maybe many others','nominated others not even him', 'told certainly among others uhm let me finish ... us and them', 'rescued all of us','rescued me and somebody else']
如您所见,第三个和最后一个短语未通过过滤器。我们可以检查其余部分是否与模式匹配:
def sentence_filter(sent, grammar):
rd_parser = nltk.RecursiveDescentParser(grammar)
try:
for p in rd_parser.parse(sent):
print("SUCCESS!")
except:
print("Doesn't match the filter...")
for s in sentences:
s = s.split()
sentence_filter(s, grammar)
当我们运行时,我们得到这个结果:
>>>
SUCCESS!
SUCCESS!
Doesn't match the filter...
SUCCESS!
SUCCESS!
SUCCESS!
Doesn't match the filter...
>>>