Scikit-Learn在列车/测试拆分之前或之后进行一次热编码

时间:2015-07-19 23:24:47

标签: python-2.7 scikit-learn

我正在研究使用scikit-learn构建模型的两个场景,我无法弄清楚为什么其中一个返回的结果与另一个结果根本不同。两种情况(我所知道的)之间唯一不同的是,在一种情况下,我一次性对所有分类变量进行热编码(在整个数据上),然后在训练和测试之间进行分割。在第二种情况下,我在训练和测试之间进行分割,然后根据训练数据对两组进行一次热编码。

后一种情况在技术上更好地判断过程的泛化误差,但是这种情况下返回的标准化gini与第一种情况相比显着不同(和差 - 基本上没有模型)。我知道第一种情况是gini(~0.33)与建立在这个数据上的模型一致。

为什么第二种情况会返回如此不同的基尼? FYI数据集包含数字和分类变量的混合。

方法1(对整个数据进行单热编码然后拆分)返回:from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston def gini(solution, submission): df = zip(solution, submission, range(len(solution))) df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) return normalized_gini # Normalized Gini Scorer gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True) if __name__ == '__main__': dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",") y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1) folds=train_test_split(range(len(y)),test_size=0.30, random_state=15) #30% test #First one hot and make a pandas df dat_dict=dat.T.to_dict().values() vectorizer = DV( sparse = False ) vectorizer.fit( dat_dict ) dat= vectorizer.transform( dat_dict ) dat=pd.DataFrame(dat) train_X=dat.iloc[folds[0],:] train_y=y[folds[0]] test_X=dat.iloc[folds[1],:] test_y=y[folds[1]] rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))

Validation Sample Score: 0.0055124452 (normalized gini).

方法2(首先拆分然后单热编码)返回:from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor from sklearn.linear_model import LogisticRegression import numpy as np import pandas as pd from sklearn.feature_extraction import DictVectorizer as DV from sklearn import metrics from sklearn.preprocessing import StandardScaler from sklearn.grid_search import GridSearchCV,RandomizedSearchCV from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor from scipy.stats import randint, uniform from sklearn.metrics import mean_squared_error from sklearn.datasets import load_boston def gini(solution, submission): df = zip(solution, submission, range(len(solution))) df = sorted(df, key=lambda x: (x[1],-x[2]), reverse=True) rand = [float(i+1)/float(len(df)) for i in range(len(df))] totalPos = float(sum([x[0] for x in df])) cumPosFound = [df[0][0]] for i in range(1,len(df)): cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0]) Lorentz = [float(x)/totalPos for x in cumPosFound] Gini = [Lorentz[i]-rand[i] for i in range(len(df))] return sum(Gini) def normalized_gini(solution, submission): normalized_gini = gini(solution, submission)/gini(solution, solution) return normalized_gini # Normalized Gini Scorer gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True) if __name__ == '__main__': dat=pd.read_table('/home/jma/Desktop/Data/Kaggle/liberty/train.csv',sep=",") y=dat[['Hazard']].values.ravel() dat=dat.drop(['Hazard','Id'],axis=1) folds=train_test_split(range(len(y)),test_size=0.3, random_state=15) #30% test #first split train_X=dat.iloc[folds[0],:] train_y=y[folds[0]] test_X=dat.iloc[folds[1],:] test_y=y[folds[1]] #One hot encode the training X and transform the test X dat_dict=train_X.T.to_dict().values() vectorizer = DV( sparse = False ) vectorizer.fit( dat_dict ) train_X= vectorizer.transform( dat_dict ) train_X=pd.DataFrame(train_X) dat_dict=test_X.T.to_dict().values() test_X= vectorizer.transform( dat_dict ) test_X=pd.DataFrame(test_X) rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15) rf.fit(train_X,train_y) y_submission=rf.predict(test_X) print("Validation Sample Score: {:.10f} (normalized gini).".format(normalized_gini(test_y,y_submission)))

List<int> intList = ((string)listBox1.Items[5]).Split('\t').Select(Int32.Parse).ToList();

2 个答案:

答案 0 :(得分:11)

虽然之前的评论正确地建议最好先映射整个要素空间,但在您的情况下,列车和测试都包含所有列中的所有要素值。

如果比较两个版本之间的vectorizer.vocabulary_,它们完全相同,因此映射没有区别。因此,它不会导致问题。

方法2失败的原因是,当您执行此命令时,dat_dict会被原始索引重新排序

dat_dict=train_X.T.to_dict().values()

换句话说,train_X有一个混洗索引进入这行代码。当您将其转换为dict时,dict顺序将重新排序为原始索引的数字顺序。这会导致您的训练和测试数据与y完全无关。

方法1不会遇到此问题,因为您在映射后随机播放数据。

您可以通过在方法2中分配.reset_index()时添加dat_dict来解决问题,例如,

dat_dict=train_X.reset_index(drop=True).T.to_dict().values()

这可确保在转换为dict时保留数据顺序。

当我添加那段代码时,我得到以下结果:
- 方法1:验证样本得分:0.3454355044(标准化基尼)
- 方法2:验证样本得分:0.3438430991(标准化基尼)

答案 1 :(得分:3)

我无法运行您的代码,但我的猜测是在测试数据集中

  • 您没有看到某些分类变量的所有级别,因此如果仅根据此数据计算虚拟变量,您实际上会有不同的列。
  • 否则,也许你有相同的列,但它们的顺序不同?