我希望将问题文件中提出的问题归纳为所有主题。这是本文中提出的两个问题的格式的示例:
question1 = 'Write short notes on the anatomy of the Circle of Willis including normal variants.'
question2 = 'Write short notes on the anatomy of the axis (C2 vertebra).'
从上述问题中,我希望得到以下主题:
topic1 = 'Circle of Willis including normal variants'
topic2 = 'axis (C2 vertebra)'
对于上述情况,我编写了以下代码段:
def extract_topic(message):
message = re.search('Write short notes on the anatomy of the (.+?).', message)
if message:
return message.group(1)
当然,上面的代码失败了!我是什么做的?最简单的方法是什么?使用NLTK可以使上述操作变得容易吗?
答案 0 :(得分:2)
尝试一下
def extract_topic(message):
message = re.search('Write short notes on the anatomy of the (.*).', message)
if message:
return message.group(1)
答案 1 :(得分:0)
.
,因为.
表示匹配换行符以外的任何字符。另外,(.+?)
是非贪婪的,因此它匹配一个字符,而.
之后又匹配一个字符。下面的代码应该可以工作,
def extract_topic(message):
message = re.search('Write short notes on the anatomy of the (.+?)\.', message)
if message:
return message.group(1)
答案 2 :(得分:0)
如果您的数据格式仍与显示的格式相同->相当简单的解决方法是:
question1 = 'Write short notes on the anatomy of the Circle of Willis including normal variants.'
question2 = 'Write short notes on the anatomy of the axis (C2 vertebra).'
list_of_questions = [question1, question2]
topics = [question.split("Write short notes on the anatomy of the ")[1] for question in list_of_questions]
print(topics)