我有一个有关带标记帖子的狗的大型论坛。来自文档频率*文本频率的索引得分使我可以完美衡量主题的主题。例如
print (getscores('dog food'))
# keyword scores range between 1 and 2
# {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5, ..... 'like':1.00001}
从那里看来,给句子打分并找到最能代表该话题的句子似乎很容易,或者我想。在此示例中,第二句话非常合适。
def method1 (sen):
score = 1
for word in sen.split():
score=score*scores.get(word,1)
return score
def method2 (sen):
score = 1
for word in sen.split():
score=score*scores.get(word,1)
return score / len(sen.split())
scores = {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5,'intended':1.4}
sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']
for sen in sens:
print (sen)
print (method1(sen))
print (method2(sen))
#dog food
#3.6
#1.8 (winner method 2)
#dog food is food intended for consumption by dogs
#13.607999999999999
#1.5119999999999998
#like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty
#22.032220320000004 (winner method 1)
#0.7868650114285716
平均分数将倾向于短句子,而增加分数将倾向于长句子。补偿句子长度(每个单词乘以0.92左右)将对一个主题有效,但对下一个主题则需要另一个因素。
所以这种方法对我毫无帮助。有没有一种已知的句子评分方法,可以使我得到具有最高关键词权重的句子,但同时又考虑了关键词的密度和句子长度?
答案 0 :(得分:0)
如果在处理管道中使用Multi-word expressions(MWE),则结果可能会有所改善。该预处理通常将在TfIdf步骤之前完成。下面的代码说明了如何使用它们:
from nltk.tokenize import MWETokenizer
#Instantiate the tokenizer with a list of NWEs:
tokenizer = MWETokenizer( [('dog', 'food'), ('band', 'camp')])
tl1 = tokenizer.tokenize('dog food is food intended for consumption by dogs'.split())
print(tl1)
tl2 = tokenizer.tokenize('like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty'.split())
print(tl2)
#['dog_food', 'is', 'food', 'intended', 'for', 'consumption', 'by', 'dogs']
#['like', 'this', 'one', 'time', 'at', 'band_camp', 'there', 'was', 'all', 'this', 'food', 'and', 'and', 'a', 'dog', 'this', 'dog', 'who', 'ate', 'all', 'the', 'food', 'and', 'then', 'my', 'bowl', 'was', 'empty']
Spacy依赖性分析器和POS标记器对于提取此类MWE很有用。
下面的示例将检测一些可能是MWE的复合名词:
import spacy
nlp = spacy.load('en_core_web_sm')
sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']
def getCompoundNouns(sentence):
doc = nlp(sentence)
answer = []
for t in doc:
if t.dep_ == 'compound' and t.pos_ == 'NOUN':
neighboringToken = t.nbor()
if neighboringToken.pos_ == 'NOUN':
answer.append((t.text, t.nbor()))
if not answer:
return(None)
return(answer)
for s in sens:
print(getCompoundNouns(s))
#[('dog', food)]
#[('dog', food)]
#[('band', camp)]