Question

我有一个有关带标记帖子的狗的大型论坛。来自文档频率*文本频率的索引得分使我可以完美衡量主题的主题。例如

print (getscores('dog food'))
# keyword scores range between 1 and 2
# {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5, ..... 'like':1.00001}

从那里看来，给句子打分并找到最能代表该话题的句子似乎很容易，或者我想。在此示例中，第二句话非常合适。

def method1 (sen):
    score = 1
    for word in sen.split():
        score=score*scores.get(word,1)
    return score

def method2 (sen):
    score = 1
    for word in sen.split():
        score=score*scores.get(word,1)
    return score / len(sen.split())

scores = {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5,'intended':1.4}
sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']


for sen in sens:
    print (sen)
    print (method1(sen))
    print (method2(sen))

#dog food
#3.6
#1.8 (winner method 2)
#dog food is food intended for consumption by dogs
#13.607999999999999
#1.5119999999999998
#like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty
#22.032220320000004 (winner method 1)
#0.7868650114285716

平均分数将倾向于短句子，而增加分数将倾向于长句子。补偿句子长度（每个单词乘以0.92左右）将对一个主题有效，但对下一个主题则需要另一个因素。

所以这种方法对我毫无帮助。有没有一种已知的句子评分方法，可以使我得到具有最高关键词权重的句子，但同时又考虑了关键词的密度和句子长度？

Answer 1

如果在处理管道中使用Multi-word expressions（MWE），则结果可能会有所改善。该预处理通常将在TfIdf步骤之前完成。下面的代码说明了如何使用它们：

from nltk.tokenize import MWETokenizer

#Instantiate the tokenizer with a list of NWEs:
tokenizer = MWETokenizer( [('dog', 'food'), ('band', 'camp')])

tl1  = tokenizer.tokenize('dog food is food intended for consumption by dogs'.split())
print(tl1)
tl2 = tokenizer.tokenize('like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty'.split())
print(tl2)

#['dog_food', 'is', 'food', 'intended', 'for', 'consumption', 'by', 'dogs']
#['like', 'this', 'one', 'time', 'at', 'band_camp', 'there', 'was', 'all', 'this', 'food', 'and', 'and', 'a', 'dog', 'this', 'dog', 'who', 'ate', 'all', 'the', 'food', 'and', 'then', 'my', 'bowl', 'was', 'empty']

Spacy依赖性分析器和POS标记器对于提取此类MWE很有用。

下面的示例将检测一些可能是MWE的复合名词：

import spacy
nlp = spacy.load('en_core_web_sm')

sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']

def getCompoundNouns(sentence):
    doc = nlp(sentence)
    answer = []
    for t in doc:
        if t.dep_ == 'compound' and t.pos_  == 'NOUN':
            neighboringToken = t.nbor()
            if neighboringToken.pos_  == 'NOUN':
                answer.append((t.text, t.nbor()))
    if not answer:
        return(None)
    return(answer)

for s in sens:
    print(getCompoundNouns(s))

#[('dog', food)]
#[('dog', food)]
#[('band', camp)]

从单词分数的句子评分

1 个答案: