python sklearn.linear_model LogisticRegression循环中的奇怪行为

时间:2017-12-23 09:56:24

标签: python machine-learning scikit-learn ipython logistic-regression

我已经关注了Andrew Ng的机器学习课程,并尝试重现python SciKit中的一些示例。

我试图理解规则参数C的影响。我经常遇到的问题最容易通过以下方式进行可视化:

for c in range(1,10):
    c = c/10.
    print( c )
    classifier = LogisticRegression(C=c , max_iter=10000)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(metrics.accuracy_score(y_test, y_pred))

我期望在较低的C值下看到更高的准确度。但是,我得到的结果有点偏颇:

0.1 .. 0.9
[0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202, 0.77653631284916202]

在循环外运行:

classifier = LogisticRegression(C=0.0001,max_iter=10000)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
metrics.accuracy_score(y_test, y_pred)

结果也是0.77653631284916202。

内核重启似乎是"重置"的唯一方法。分类器。重新启动并加载数据后,运行上面的代码会得到预期的更高值:0.8044692737430168。

这是预期的行为吗?或者我在滥用python / scikit?

我在osx 10.13.2上。直接从终端以及(Anaconda)Spyder和(Anaconda)Notebook Jypiter(Anaconda2-5.0.1和Anaconda3-5.0.1)在IPython中尝试了这一点。所有套餐都是最新的。

编辑:根据要求,下面是完整的代码。 Train.csv可以从Kaggle Titanic数据集下载。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('train.csv')

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
dataset['Sex'] = le.fit_transform(dataset['Sex'])
le2 = preprocessing.LabelEncoder()
dataset['Embarked'] = le2.fit_transform(dataset['Embarked'].astype(str) )

# remove NaN in age coloumn
imputer = preprocessing.Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(dataset['Age'].reshape(-1, 1))
dataset['Age'] = imputer.transform(dataset['Age'].reshape(-1, 1))

y = dataset.iloc[:, 1].values
X = dataset.iloc[:, [2,4,5,6,7,9,11]]
X[0:5]

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn import metrics
for c in range(1,10):
    c = c/10.
    print( c )
    classifier = LogisticRegression(C=c , random_state = 42)
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    print(metrics.accuracy_score(y_test, y_pred))

0 个答案:

没有答案