如何从句子中提取主题?

时间:2018-07-14 11:10:27

标签: python regex parsing nltk

我希望将问题文件中提出的问题归纳为所有主题。这是本文中提出的两个问题的格式的示例:

question1 = 'Write short notes on the anatomy of the Circle of Willis including normal variants.'
question2 = 'Write short notes on the anatomy of the axis (C2 vertebra).'

从上述问题中,我希望得到以下主题:

topic1 = 'Circle of Willis including normal variants'
topic2 = 'axis (C2 vertebra)'

对于上述情况,我编写了以下代码段:

def extract_topic(message):
    message = re.search('Write short notes on the anatomy of the (.+?).', message)
    if message:
        return message.group(1)

当然,上面的代码失败了!我是什么做的?最简单的方法是什么?使用NLTK可以使上述操作变得容易吗?

3 个答案:

答案 0 :(得分:2)

尝试一下

def extract_topic(message):
    message = re.search('Write short notes on the anatomy of the (.*).', message)
    if message:
        return message.group(1)

答案 1 :(得分:0)

  • 您的正则表达式只有一个错误,您忘了最后逃脱.,因为.表示匹配换行符以外的任何字符。另外,(.+?)是非贪婪的,因此它匹配一个字符,而.之后又匹配一个字符。

下面的代码应该可以工作,

def extract_topic(message):
message = re.search('Write short notes on the anatomy of the (.+?)\.', message)
if message:
    return message.group(1)

答案 2 :(得分:0)

如果您的数据格式仍与显示的格式相同->相当简单的解决方法是:

question1 = 'Write short notes on the anatomy of the Circle of Willis including normal variants.'
question2 = 'Write short notes on the anatomy of the axis (C2 vertebra).'

list_of_questions = [question1, question2]

topics = [question.split("Write short notes on the anatomy of the ")[1] for question in list_of_questions]

print(topics)