我有一个问题,我有2个数据集,AdultTest和AdultData。在这些数据集中,我有很多这样的行:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Female , 2174, 0, 40, United-States, >50K
我想计算出“女性”拥有> 50K的概率,为此,我这样做了:
from sklearn.naive_bayes import BernoulliNB
#Read AdultData.csv and encoded in Integer, so can I calculate the NaiveBAyes
data1 = np.genfromtxt('AdultData.csv', delimiter=',', dtype='int', skip_footer=1)
datatest=np.genfromtxt('adultTest.csv', delimiter=',', dtype='int', skip_footer=1)
#Delete the last Column, because the last column is the Target
data_new = np.delete(data1, 14, 1)
dataTest_new = np.delete(datatest, 14, 1)
class_ = [row[14] for row in data2]
clf = BernoulliNB()
clf.fit(data_new, class_)
print(clf.predict_proba(dataTest_new))
结果是概率的预测,而我总是得到:
[1。 0。]
但是我不知道为什么,即使我输入了AdultTest(这些都有另一个数据),我也会收到相同的结果。
为什么我没有收到其他结果?此外,为什么我有2列?
P.S。之所以这样做,是因为我想做不区分的分类的按摩算法
有人可以帮忙吗?
谢谢!
答案 0 :(得分:0)
我认为您的代码中存在逻辑错误,因为您从不使用dataTest_new
data_new = np.delete(data1, 14, 1)
dataTest_new = np.delete(datatest, 14, 1)
class_ = [row[14] for row in data2]
clf = BernoulliNB()
clf.fit(data_new, class_)
# you should run prediction on test data
print(clf.predict_proba(dataTest_new))