Question

我有一个问题，我有2个数据集，AdultTest和AdultData。在这些数据集中，我有很多这样的行：

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female , 2174, 0, 40, United-States, >50K

我想计算出“女性”拥有＞ 50K的概率，为此，我这样做了：

from sklearn.naive_bayes import BernoulliNB

#Read AdultData.csv and encoded in Integer, so can I calculate the NaiveBAyes
data1 = np.genfromtxt('AdultData.csv', delimiter=',',  dtype='int', skip_footer=1)
datatest=np.genfromtxt('adultTest.csv', delimiter=',',  dtype='int', skip_footer=1)

#Delete the last Column, because the last column is the Target
data_new = np.delete(data1, 14, 1)
dataTest_new = np.delete(datatest, 14, 1)

class_ = [row[14] for row in data2]

clf = BernoulliNB()
clf.fit(data_new, class_)
print(clf.predict_proba(dataTest_new))

结果是概率的预测，而我总是得到：

[1。 0。]

但是我不知道为什么，即使我输入了AdultTest（这些都有另一个数据），我也会收到相同的结果。

为什么我没有收到其他结果？此外，为什么我有2列？

P.S。之所以这样做，是因为我想做不区分的分类的按摩算法

有人可以帮忙吗？

谢谢！

Answer 1

我认为您的代码中存在逻辑错误，因为您从不使用dataTest_new

data_new = np.delete(data1, 14, 1)
dataTest_new = np.delete(datatest, 14, 1)

class_ = [row[14] for row in data2]

clf = BernoulliNB()
clf.fit(data_new, class_)
# you should run prediction on test data
print(clf.predict_proba(dataTest_new))

朴素贝叶斯概率错误Python

1 个答案: