提高sklearn模型的准确性

时间:2017-12-07 04:15:40

标签: machine-learning scikit-learn classification decision-tree sklearn-pandas

决策树分类给出了0.52的准确度,但我想提高准确性。如何使用sklearn中提供的任何分类模型来提高准确度。

我使用了knn,决策树和交叉验证,但所有这些都提供了较低的准确性。

谢谢

import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

#read from the csv file and return a Pandas DataFrame.
nba = pd.read_csv('wine.csv')

# print the column names
original_headers = list(nba.columns.values)
print(original_headers)

#print the first three rows.
print(nba[0:3])

# "Position (pos)" is the class attribute we are predicting. 
class_column = 'quality'

#The dataset contains attributes such as player name and team name. 
#We know that they are not useful for classification and thus do not 
#include them as features. 
feature_columns = ['fixed acidity', 'volatile acidity', 'citric     acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur         dioxide', 'density', 'pH','sulphates', 'alcohol']

#Pandas DataFrame allows you to select columns. 
#We use column selection to split the data into features and class. 
nba_feature = nba[feature_columns]
nba_class = nba[class_column]

print(nba_feature[0:3])
print(list(nba_class[0:3]))

train_feature, test_feature, train_class, test_class = \
train_test_split(nba_feature, nba_class, stratify=nba_class, \
train_size=0.75, test_size=0.25)

training_accuracy = []
test_accuracy = []

knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn.fit(train_feature, train_class)
prediction = knn.predict(test_feature)
print("Test set predictions:\n{}".format(prediction))
print("Test set accuracy: {:.2f}".format(knn.score(test_feature, test_class)))

train_class_df = pd.DataFrame(train_class,columns=[class_column])     
train_data_df = pd.merge(train_class_df, train_feature, left_index=True, right_index=True)
train_data_df.to_csv('train_data.csv', index=False)

temp_df = pd.DataFrame(test_class,columns=[class_column])
temp_df['Predicted Pos']=pd.Series(prediction, index=temp_df.index)
test_data_df = pd.merge(temp_df, test_feature, left_index=True, right_index=True)
test_data_df.to_csv('test_data.csv', index=False)

tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(train_feature, train_class)
print("Training set score: {:.3f}".format(tree.score(train_feature, train_class)))
print("Test set score Decision: {:.3f}".format(tree.score(test_feature, test_class)))

prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
cancer = nba.as_matrix()
tree = DecisionTreeClassifier(max_depth=4, random_state=0)

scores = cross_val_score(tree, train_feature,train_class, cv=10)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))

2 个答案:

答案 0 :(得分:0)

通常DT之后的下一步是RF(和它的邻居)或XGBoost(但它不是sklearn)。试试吧。 DT非常容易过度使用。

删除异常值。检查数据集中的类:如果它们不平衡,则大多数错误可能存在。在这种情况下,您需要在拟合时使用权重或使用公制函数(或使用f1)。

你可以在这里附上你的混淆矩阵 - 很高兴看到。

NN(即使是来自sklearn)也可能表现出更好的效果。

答案 1 :(得分:0)

改善预处理。

DT和kNN等方法可能对预处理列的方式很敏感。例如,DT可以从连续变量的精心选择的阈值中获益很多。