在RegEx中使用单词“ but”对句子进行分块

时间:2018-08-25 06:01:13

标签: python regex nltk chunking

我正在尝试使用RegEx在单词“ but”(或任何其他协调词)上对句子进行分块。它不起作用...

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

我需要将句子"There are no large collections present but there is spinal canal stenosis."分为两部分:

1. "There are no large collections present"
2. "there is spinal canal stenosis."

我还希望使用相同的代码将句子“ and”和其他协调词(CC)分开。但是我的代码无法正常工作。请帮忙。

2 个答案:

答案 0 :(得分:1)

我想你可以做

import re
result = re.split(r"\s+(?:but|and)\s+", sentence)

其中

`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:`       Match the regular expression below, do not capture
            Match either the regular expression below (attempting the next alternative only if this one fails)
  `but`     Match the characters "but" literally
  `|`       Or match regular expression number 2 below (the entire group fails if this one fails to match)
  `and`     Match the characters "and" literally
)
`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)

您可以在其中添加更多的连词,并用竖线字符|分隔。 请注意,尽管这些单词不包含在正则表达式中具有特殊含义的字符。如有疑问,请先使用re.escape(word)

对其进行转义

答案 1 :(得分:1)

如果您要避免对诸如“ but”和“ and”之类的连词进行硬编码,请尝试将其与分块一起使用:


import nltk
Digdug = nltk.RegexpParser(r""" 
CHUNK_AND_CHINK:
{<.*>+}          # Chunk everything
}<CC>+{      # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))

result = Digdug.parse(sentence)

for subtree in result.subtrees(filter=lambda t: t.label() == 
'CHUNK_AND_CHINK'):
            print (subtree)

Ching基本上从块短语中排除了我们不需要的东西-在这种情况下为'but'。 有关更多详细信息,请参见:http://www.nltk.org/book/ch07.html