我已经关注了Andrew Ng的机器学习课程,并尝试重现python SciKit中的一些示例。
我试图理解规则参数C的影响。我经常遇到的问题最容易通过以下方式进行可视化:
for c in range(1,10):
c = c/10.
print( c )
classifier = LogisticRegression(C=c , max_iter=10000)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))
我期望在较低的C值下看到更高的准确度。但是,我得到的结果有点偏颇:
0.1 .. 0.9
[0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202]
在循环外运行:
classifier = LogisticRegression(C=0.0001,max_iter=10000)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
metrics.accuracy_score(y_test, y_pred)
结果也是0.77653631284916202。
内核重启似乎是"重置"的唯一方法。分类器。重新启动并加载数据后,运行上面的代码会得到预期的更高值:0.8044692737430168。
这是预期的行为吗?或者我在滥用python / scikit?
我在osx 10.13.2上。直接从终端以及(Anaconda)Spyder和(Anaconda)Notebook Jypiter(Anaconda2-5.0.1和Anaconda3-5.0.1)在IPython中尝试了这一点。所有套餐都是最新的。
编辑:根据要求,下面是完整的代码。 Train.csv可以从Kaggle Titanic数据集下载。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('train.csv')
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
dataset['Sex'] = le.fit_transform(dataset['Sex'])
le2 = preprocessing.LabelEncoder()
dataset['Embarked'] = le2.fit_transform(dataset['Embarked'].astype(str) )
# remove NaN in age coloumn
imputer = preprocessing.Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(dataset['Age'].reshape(-1, 1))
dataset['Age'] = imputer.transform(dataset['Age'].reshape(-1, 1))
y = dataset.iloc[:, 1].values
X = dataset.iloc[:, [2,4,5,6,7,9,11]]
X[0:5]
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn import metrics
for c in range(1,10):
c = c/10.
print( c )
classifier = LogisticRegression(C=c , random_state = 42)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))