我需要在NLTK的正则表达式解析器中创建一个not condition作为我语法的一部分。我想将那些具有结构'Coffee & Tea'
的单词组成块,但如果在序列之前有<IN>
类型的单词,则它不应该是块。例如,'in London and Paris'
不应该被解析器分块。
我的代码如下:
grammar = r'''NP: {(^<IN>)<NNP>+<CC><NN.*>+}'''
我尝试了上面的语法来解决这个问题,但它没有用,有人可以告诉我我做错了什么。
示例:
def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
grammar = r'''NP: {<NNP>+<CC><NN.*>+}'''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print result
sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)
sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)
Result for sentence 1 is:
(S
Who/WP
is/VBZ
the/DT
front/JJ
man/NN
of/IN
the/DT
band/NN
that/WDT
wrote/VBD
(NP Coffee/NNP &/CC TV/NN)
?/.)
Result for sentence2 is:
(S
Who/WP
of/IN
those/DT
resting/VBG
in/IN
Westminster/NNP
Abbey/NNP
wrote/VBD
a/DT
book/NN
set/VBN
in/IN
(NP London/NNP and/CC Paris/NNP)
?/.)
正如在句子1和句子2中都可以看到的那样,短语Coffee & Tea
和London and Paris
作为一个组被分块,尽管我不希望分组London and Paris
。这样做的一种方法是忽略那些前面有<IN>
POS标记的模式。
简而言之,我需要知道如何在正则表达式解析器的语法中为POS标记添加NOT(否定)条件。使用&#39; ^&#39;的标准语法其次是标签定义似乎不起作用
答案 0 :(得分:3)
你需要的是一个“负面的背后”&#34;表达。不幸的是,它在块解析器中不起作用,所以我怀疑你想要的不能被指定为一个chunking regexp。
这是一个普通的负面观察背景:匹配&#34;巴黎&#34;,但如果先前是&#34;和&#34;而不是。
>>> re.findall(r"(?<!and) Paris", "Search in London and Paris etc.")
[]
不幸的是,相应的lookbehind chunking规则不起作用。 nltk的正则表达式引擎会调整你传递它的正则表达式,以便解释POS类型,并且它会被外观混淆。 (我猜测lookbehind语法中的<
字符被误解为标记分隔符。)
>>> parser = nltk.RegexpParser(r"NP: {(?<!<IN>)<NNP>+<CC><NN.*>+}")
...
ValueError: Illegal chunk pattern: {(?<!<IN>)<NNP>+<CC><NN.*>+}
答案 1 :(得分:1)
NLTK的标签分块文档有点混乱,而且不易获得,因此为了完成类似的工作,我付出了很多努力。
检查以下链接:
在@Luda的回答之后,我找到了一个简单的解决方案:
示例(以@Ram G Athreya的问题为例)
def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
grammar = r'''
NP: {<IN>*<NNP>+<CC><NN.*>+}
}<IN><NNP>+<CC><NN.*>+{
'''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)
sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)
sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)
(S
Who/WP
is/VBZ
the/DT
front/JJ
man/NN
of/IN
the/DT
band/NN
that/WDT
wrote/VBD
(NP Coffee/NNP &/CC TV/NN)
?/.)
(S
Who/WP
of/IN
those/DT
resting/VBG
in/IN
Westminster/NNP
Abbey/NNP
wrote/VBD
a/DT
book/NN
set/VBN
in/IN
London/NNP
and/CC
Paris/NNP
?/.)
现在将“咖啡和电视”分块,但不将“伦敦和巴黎”分块
此外,这对于构建隐式断言非常有用,在RegExp中通常为?<= ,但这会与 << / strong>和chunk_tag语法正则表达式中使用的> 符号。
因此,为了建立后向效果,我们可以尝试以下操作:
示例2-将所有单词后加
def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
grammar = r'''
CHUNK: {<IN>+<.*>}
}<IN>{
'''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)
sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)
sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)
(S
Who/WP
is/VBZ
the/DT
front/JJ
man/NN
of/IN
(CHUNK the/DT)
band/NN
that/WDT
wrote/VBD
Coffee/NNP
&/CC
TV/NN
?/.)
(S
Who/WP
of/IN
(CHUNK those/DT)
resting/VBG
in/IN
(CHUNK Westminster/NNP)
Abbey/NNP
wrote/VBD
a/DT
book/NN
set/VBN
in/IN
(CHUNK London/NNP)
and/CC
Paris/NNP
?/.)
我们可以看到,它从句子1中分割了“ the” ;句子2中的“那些” ,“威斯敏斯特” 和“伦敦”
答案 2 :(得分:0)
“我们可以将chink定义为一个未包含在块中的标记序列”
http://www.nltk.org/book/ch07.html
请参阅反花括号以进行排除
grammar =
r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""