Question

我希望将问题文件中提出的问题归纳为所有主题。这是本文中提出的两个问题的格式的示例：

question1 = 'Write short notes on the anatomy of the Circle of Willis including normal variants.'
question2 = 'Write short notes on the anatomy of the axis (C2 vertebra).'

从上述问题中，我希望得到以下主题：

topic1 = 'Circle of Willis including normal variants'
topic2 = 'axis (C2 vertebra)'

对于上述情况，我编写了以下代码段：

def extract_topic(message):
    message = re.search('Write short notes on the anatomy of the (.+?).', message)
    if message:
        return message.group(1)

当然，上面的代码失败了！我是什么做的？最简单的方法是什么？使用NLTK可以使上述操作变得容易吗？

Answer 1

尝试一下

def extract_topic(message):
    message = re.search('Write short notes on the anatomy of the (.*).', message)
    if message:
        return message.group(1)

Answer 2

您的正则表达式只有一个错误，您忘了最后逃脱.，因为.表示匹配换行符以外的任何字符。另外，(.+?)是非贪婪的，因此它匹配一个字符，而.之后又匹配一个字符。

下面的代码应该可以工作，

def extract_topic(message):
message = re.search('Write short notes on the anatomy of the (.+?)\.', message)
if message:
    return message.group(1)

Answer 3

如果您的数据格式仍与显示的格式相同->相当简单的解决方法是：

question1 = 'Write short notes on the anatomy of the Circle of Willis including normal variants.'
question2 = 'Write short notes on the anatomy of the axis (C2 vertebra).'

list_of_questions = [question1, question2]

topics = [question.split("Write short notes on the anatomy of the ")[1] for question in list_of_questions]

print(topics)

如何从句子中提取主题？

3 个答案: