Question

下面的代码正确地对句子的温度进行了分类，例如：我喜欢喝茶，让我们去晒太阳，等等：

rawReviews = ["Sun is hot", "Moon is cold", "Tea is hot", "Icecream is cold", "Milk is cold"]
rawLabels = [1, 0, 1, 0, 0] # 1 - hot review, 0 - cold review

reviews, Y = [rawReviews, rawLabels]

tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')


txt_fitted = tf.fit(rawReviews)
txt_transformed = txt_fitted.transform(rawReviews)

model = MultinomialNB()
model.fit (txt_transformed,rawLabels)

sentenseToClassify = "i like vodka"
new_X = tf.transform([sentenseToClassify])
predicted_Y = model.predict(new_X)
probability = model.predict_proba (new_X)
print ("0 is cold; 1 is hot\nSentense: %s \n\nPredicted: %d prabability of cold: %f, probability of hot %f" %
      ( sentenseToClassify,predicted_Y,probability[0][0],probability[0][1] ) )

但是，我不清楚如何针对完全未知的数据做出概率决策。例如：

上面的代码将“我喜欢伏特加”分类为50/50。难道不应该同时将其归为0和冷吗？
如果添加第三个标签（“ 2表示温暖”），则分类概率将类似于：array([[0.4, 0.4, 0.2]])。再次，概率不应该平均分配还是仅取0？

多项式朴素贝叶斯文本分类如何处理未知数据

0 个答案: