我编写了一小段代码来找到适合CDC数据集的最佳分类器。首先我尝试了各种scikit-learn分类器然后我决定添加TF.Learn(DNNClassifier和LinearClassifier),因为API几乎相同。
然后,当我比较结果时,所有scikit-learn模型很容易达到60-70%的准确度,并且使用TF.learn DNNClassifiers和LinearClassifier我不能超过38%并且需要花费很多时间(甚至挂起,如果我在拟合模型时不要设置步数。
我可能犯了一个错误,但我没有看到它......
以下是代码提取:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
for classifier in classifiers:
if classifier == "TF Deep Neural Network":
feature_columns = learn.infer_real_valued_columns_from_input(X_train)
clf = DNNClassifier(feature_columns=feature_columns,
hidden_units=[10,10,10],
n_classes=2, enable_centered_bias=None);
clf.fit(X_train, Y_train, steps=200)
elif classifier == "TF Linear Classifier":
feature_columns = learn.infer_real_valued_columns_from_input(X_train)
clf = LinearClassifier(n_classes=2, feature_columns=feature_columns)
clf.fit(X_train, Y_train, steps=200)
else:
clf = getInstance(classifiers[classifier][0], classifiers[classifier][1], classifiers[classifier][2])
clf.fit(X_train, Y_train)
# predict on test data
prediction = clf.predict(X_test)
# compute accuracy and sum it to the previous ones
accuracy = accuracy_score(Y_test, prediction)
结果摘录:
classifier Gaussian Naive Bayes accuracy 0.85
classifier K-Nearest Neighbors accuracy 0.87
classifier TF Deep Neural Network accuracy 0.4
classifier Random Forest accuracy 0.85
classifier TF Linear Classifier accuracy 0.4
classifier Decision Tree accuracy 0.87
classifier Neural Network accuracy 0.4
classifier AdaBoost accuracy 0.86
classifier Linear Support Vector Machine accuracy 0.88
classifier Radial Basic Function Support Vector Machine accuracy 0.74
此处的完整代码:https://github.com/shazz/gender_classification_challenge/blob/master/demo_with_BRFSS_and_TF.py
因此,有关TF.Learn准确性如此之低(并且需要花费大量时间)的任何见解将不胜感激!
更新基于Kumara的回答
我将标签修改为0或1(而不是原始CDC数据集中的1和2),然后再次运行分类器测试。新结果是:
classifier AdaBoost accuracy 0.87
classifier Linear Support Vector Machine accuracy 0.86
classifier K-Nearest Neighbors accuracy 0.86
classifier Gaussian Naive Bayes accuracy 0.85
classifier Random Forest accuracy 0.85
classifier Radial Basic Function Support Vector Machine accuracy 0.83
classifier Decision Tree accuracy 0.83
classifier Neural Network accuracy 0.64
classifier TF Deep Neural Network accuracy 0.63
classifier TF Linear Classifier accuracy 0.62
所以仍然相当落后于scikit学习分类器。 可能有意义的是,DNNC分类器与scikit学习多层感知器分类器一样“糟糕”。
您是否认为考虑到数据和分类器的类型,TF.Learn DNNClassifier和LinearClassifier的准确度是否正常?
答案 0 :(得分:3)
问题是TF.learn分类器期望类标签作为索引(即对于2类问题,y必须为0或1),而scikit学习分类器将y视为任意值(例如77和99有效) 2类问题中y的值。)
在这种情况下,查看数据时,类标签是1和2.因此,TF.learn训练偶尔会看到2的越界值,它忽略了。因此它总是预测'1'(如果你在调用predict()之后打印'预测'和'Y_test',这就变得很明显了)。标签值“1”可能是数据的40%,因此精度达到40%。
修复方法是将标签映射到类索引(例如,将标签'1'标记为索引0,将标记'2'标记为索引1)。例如,我在加载数据后使用“Y = Y - 1”执行此操作(尽管更通用的解决方案对于任意值更好):
# load data and convert Y into 1d vector
X, Y = data_importer.load_data(500)
print("X", X.shape, "Y", Y.shape)
# FIX: Y/Labels are "1" and "2" for male/female. We should convert
# these to indices into a 2 class space
# (i.e. "1" is index 0, and "2" is index 1).
Y = Y - 1
# train and check the model against the test data for each classifier
iterations = 1
results = {}
for itr in range(iterations):
# Resuffle training/testing datasets by sampling randomly 80/20% of the input data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
...
理想情况下,这两个API是兼容的,或者至少TF.learn API应该更清楚地记录这种区别。虽然可以说使用类索引对任意类(例如图像类)更有效和更清晰。