多项式朴素贝叶斯文本分类如何处理未知数据

时间:2018-08-17 19:08:40

标签: machine-learning scikit-learn classification naivebayes multinomial

下面的代码正确地对句子的温度进行了分类,例如:我喜欢喝茶,让我们去晒太阳,等等:

rawReviews = ["Sun is hot", "Moon is cold", "Tea is hot", "Icecream is cold", "Milk is cold"]
rawLabels = [1, 0, 1, 0, 0] # 1 - hot review, 0 - cold review

reviews, Y = [rawReviews, rawLabels]

tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')


txt_fitted = tf.fit(rawReviews)
txt_transformed = txt_fitted.transform(rawReviews)

model = MultinomialNB()
model.fit (txt_transformed,rawLabels)

sentenseToClassify = "i like vodka"
new_X = tf.transform([sentenseToClassify])
predicted_Y = model.predict(new_X)
probability = model.predict_proba (new_X)
print ("0 is cold; 1 is hot\nSentense: %s \n\nPredicted: %d prabability of cold: %f, probability of hot %f" %
      ( sentenseToClassify,predicted_Y,probability[0][0],probability[0][1] ) )

但是,我不清楚如何针对完全未知的数据做出概率决策。例如:

  1. 上面的代码将“我喜欢伏特加”分类为50/50。难道不应该同时将其归为0和冷吗?
  2. 如果添加第三个标签(“ 2表示温暖”),则分类概率将类似于:array([[0.4, 0.4, 0.2]])。再次,概率不应该平均分配还是仅取0?

0 个答案:

没有答案