cp.2.5“Chinking”

Question

我需要在NLTK的正则表达式解析器中创建一个not condition作为我语法的一部分。我想将那些具有结构'Coffee & Tea'的单词组成块，但如果在序列之前有<IN>类型的单词，则它不应该是块。例如，'in London and Paris'不应该被解析器分块。

我的代码如下：

grammar = r'''NP: {(^<IN>)<NNP>+<CC><NN.*>+}'''

我尝试了上面的语法来解决这个问题，但它没有用，有人可以告诉我我做错了什么。

示例：

def parse_sentence(sentence):
    pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))
    grammar = r'''NP: {<NNP>+<CC><NN.*>+}'''
    parser = nltk.RegexpParser(grammar)
    result = parser.parse(pos_sentence)
    print result

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)

Result for sentence 1 is:
(S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)

Result for sentence2 is:
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  (NP London/NNP and/CC Paris/NNP)
  ?/.)

正如在句子1和句子2中都可以看到的那样，短语Coffee & Tea和London and Paris作为一个组被分块，尽管我不希望分组London and Paris。这样做的一种方法是忽略那些前面有<IN> POS标记的模式。

简而言之，我需要知道如何在正则表达式解析器的语法中为POS标记添加NOT（否定）条件。使用＆＃39; ^＆＃39;的标准语法其次是标签定义似乎不起作用

Answer 1

你需要的是一个“负面的背后”＆＃34;表达。不幸的是，它在块解析器中不起作用，所以我怀疑你想要的不能被指定为一个chunking regexp。

这是一个普通的负面观察背景：匹配＆＃34;巴黎＆＃34;，但如果先前是＆＃34;和＆＃34;而不是。

>>> re.findall(r"(?<!and) Paris", "Search in London and Paris etc.")
[]

不幸的是，相应的lookbehind chunking规则不起作用。 nltk的正则表达式引擎会调整你传递它的正则表达式，以便解释POS类型，并且它会被外观混淆。（我猜测lookbehind语法中的<字符被误解为标记分隔符。）

>>> parser = nltk.RegexpParser(r"NP: {(?<!<IN>)<NNP>+<CC><NN.*>+}")
...
ValueError: Illegal chunk pattern: {(?<!<IN>)<NNP>+<CC><NN.*>+}

Answer 2

NLTK的标签分块文档有点混乱，而且不易获得，因此为了完成类似的工作，我付出了很多努力。

检查以下链接：

在@Luda的回答之后，我找到了一个简单的解决方案：

对您想要的内容进行分类： * <其他标签>标签。这将从任何带有0个或多个标记的单词开始的块中创建。
从上一个块表达式中替换个标记。这将删除所有以一个标记单词开头的块。（我们删除了星号）。

示例（以@Ram G Athreya的问题为例）

def parse_sentence(sentence):
pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence))

grammar = r'''
    NP: {<IN>*<NNP>+<CC><NN.*>+}
        }<IN><NNP>+<CC><NN.*>+{
        '''
parser = nltk.RegexpParser(grammar)
result = parser.parse(pos_sentence)
print (result)

sentence1 = 'Who is the front man of the band that wrote Coffee & TV?'
parse_sentence(sentence1)

sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?'
parse_sentence(sentence2)


 (S
  Who/WP
  is/VBZ
  the/DT
  front/JJ
  man/NN
  of/IN
  the/DT
  band/NN
  that/WDT
  wrote/VBD
  (NP Coffee/NNP &/CC TV/NN)
  ?/.)
(S
  Who/WP
  of/IN
  those/DT
  resting/VBG
  in/IN
  Westminster/NNP
  Abbey/NNP
  wrote/VBD
  a/DT
  book/NN
  set/VBN
  in/IN
  London/NNP
  and/CC
  Paris/NNP
  ?/.)

现在将“咖啡和电视”分块，但不将“伦敦和巴黎”分块

此外，这对于构建隐式断言非常有用，在RegExp中通常为？<= ，但这会与 << / strong>和chunk_tag语法正则表达式中使用的> 符号。

因此，为了建立后向效果，我们可以尝试以下操作：

对所需内容进行分类，在开头包含标签，然后再添加其他所需标签。这将创建带有0个或多个标记的任何单词开头的块。
从上一个块表达式中
点击标签。这将从大块中删除所有带有标记的单词。

示例2-将所有单词后加标记的单词

def parse_sentence(sentence): pos_sentence = nltk.pos_tag(nltk.word_tokenize(sentence)) grammar = r''' CHUNK: {<IN>+<.*>} }<IN>{ ''' parser = nltk.RegexpParser(grammar) result = parser.parse(pos_sentence) print (result) sentence1 = 'Who is the front man of the band that wrote Coffee & TV?' parse_sentence(sentence1) sentence2 = 'Who of those resting in Westminster Abbey wrote a book set in London and Paris?' parse_sentence(sentence2) (S Who/WP is/VBZ the/DT front/JJ man/NN of/IN (CHUNK the/DT) band/NN that/WDT wrote/VBD Coffee/NNP &/CC TV/NN ?/.) (S Who/WP of/IN (CHUNK those/DT) resting/VBG in/IN (CHUNK Westminster/NNP) Abbey/NNP wrote/VBD a/DT book/NN set/VBN in/IN (CHUNK London/NNP) and/CC Paris/NNP ?/.)

我们可以看到，它从句子1中分割了“ the” ；句子2中的“那些” ，“威斯敏斯特” 和“伦敦”

Answer 3

cp.2.5“Chinking”

“我们可以将chink定义为一个未包含在块中的标记序列”

http://www.nltk.org/book/ch07.html

请参阅反花括号以进行排除

grammar = 
        r"""
          NP:
            {<.*>+}          # Chunk everything
            }<VBD|IN>+{      # Chink sequences of VBD and IN

         """

在NLTK Regex Parser中没有条件

3 个答案:

cp.2.5“Chinking”