下面的代码正确地对句子的温度进行了分类,例如:我喜欢喝茶,让我们去晒太阳,等等:
rawReviews = ["Sun is hot", "Moon is cold", "Tea is hot", "Icecream is cold", "Milk is cold"]
rawLabels = [1, 0, 1, 0, 0] # 1 - hot review, 0 - cold review
reviews, Y = [rawReviews, rawLabels]
tf = TfidfVectorizer(smooth_idf=False, sublinear_tf=False, norm=None, analyzer='word')
txt_fitted = tf.fit(rawReviews)
txt_transformed = txt_fitted.transform(rawReviews)
model = MultinomialNB()
model.fit (txt_transformed,rawLabels)
sentenseToClassify = "i like vodka"
new_X = tf.transform([sentenseToClassify])
predicted_Y = model.predict(new_X)
probability = model.predict_proba (new_X)
print ("0 is cold; 1 is hot\nSentense: %s \n\nPredicted: %d prabability of cold: %f, probability of hot %f" %
( sentenseToClassify,predicted_Y,probability[0][0],probability[0][1] ) )
但是,我不清楚如何针对完全未知的数据做出概率决策。例如:
array([[0.4, 0.4, 0.2]])
。再次,概率不应该平均分配还是仅取0?