Question

我想对不同用户输入的响应进行评分/评分。为此，我使用了Multinomial navie bayes。下面是我的代码。

# use natural language toolkit
import nltk
from nltk.stem.lancaster import LancasterStemmer
import os
import json
import datetime
stemmer = LancasterStemmer()   
# 3 classes of training data
training_data = []
# capture unique stemmed words in the training corpus
class_words={}
corpus_words = {}
classes = list(set([a['class'] for a in training_data]))
for c in classes:
    class_words[c] = []

for data in training_data:
    # tokenize each sentence into words
    for word in nltk.word_tokenize(data['sentence']):
        # ignore a few things
        if word not in ["?", "'s"]:
            # stem and lowercase each word
            stemmed_word = stemmer.stem(word.lower())
            if stemmed_word not in corpus_words:
                corpus_words[stemmed_word] = 1
            else:
                corpus_words[stemmed_word] += 1

            class_words[data['class']].extend([stemmed_word])

# we now have each word and the number of occurances of the word in our training corpus (the word's commonality)
print ("Corpus words and counts: %s" % corpus_words)
# also we have all words in each class
print ("Class words: %s" % class_words)
sentence="The biggest advantages to a JavaScript having a ability to support all modern browser and produce the same result."
def calculate_class_score(sentence, class_name):
    score = 0
    for word in nltk.word_tokenize(sentence):
        if word in class_words[class_name]:
            score += 1
    return score
for c in class_words.keys():
    print ("Class: %s  Score: %s" % (c, calculate_class_score(sentence, c)))
# calculate a score for a given class taking into account word commonality
def calculate_class_score_commonality(sentence, class_name):
    score = 0
    for word in nltk.word_tokenize(sentence):
        if word in class_words[class_name]:
            score += (1 / corpus_words[word])
    return score
# now we can find the class with the highest score
for c in class_words.keys():
    print ("Class: %s  Score: %s" % (c, calculate_class_score_commonality(sentence, c)))
def find_class(sentence):
    high_class = None
    high_score = 0
    for c in class_words.keys():
        score = calculate_class_score_commonality(sentence, c)
        if score > high_score:
            high_class = c
            high_score = score
    return high_class, high_score

注意：我尚未添加任何训练数据。

当我输入为

find_class("the biggest advantages to a JavaScript having a ability to
 support all modern browser and produce the same result.JavaScript
 small bit of code you can test")

我得到的输出为

('Advantages', 5.07037037037037)

但是当我输入为

时

find_class("JavaScript can be executed within the user's browser
without having to communicate with the server, saving on bandwidth")

我得到的响应/输出为

('Advantages', 2.0454545)

我正在为JavaScript面试/ viva问题构建它。当用户以与我上面提到的方式不同的方式键入相同的答案时，我会得到不同的分数。我希望分数准确。我该怎么做。

Answer 1

多项式朴素贝叶斯将为不同的输入输出不同的分数。实际上，任何分类算法都是如此。

两个不同句子获得相同分数的唯一方法是使这些句子包含完全相同的单词（以不同的顺序或频率）

有关更多详细信息，请参见the algorithm's definition。

Answer 2

多项式朴素贝叶斯比较单词出现的位置。它没有考虑顺序，因为它认为每个功能都相互独立。因此，使用朴素贝叶斯解决语义相似性（不同句子，相同含义）并不总是容易解决的问题。

如果在您的情况下，语义相似性与存在的单词有某种直接关联（在某种程度上可以忽略顺序），那么您可以尝试以下操作：

试玩数据。查看哪些结果会导致像停用词删除或使用TF-IDF yield这样的技术。
看看Word2Vec（或Doc2Vec）是否能为您带来更好的结果
使用更多训练数据

这些是我很懒惰的建议，我可能不了解您的数据的样子就给出了这些建议。

对来自不同用户的多个回复进行评分

2 个答案: