否则:
二进制使用LogisticRegression对train.csv(https://www.kaggle.com/c/titanic/data)进行分类
“train.csv”是泰坦乘客名单csv文件
标签是“幸存”
拆分前100行是测试集,订单是训练集。
问题:
1st:当我使用参数网格时:
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.threshold, Array(0.2, 0.3, 0.35, 0.4))
.build()
结果日志
training set:
marked "Survived" of [Prediction/Label]'s Count : 273 / 259
marked "Death" of [Prediction/Label]'s Count : 363 / 377
Accuracy is : 97.79874213836479% (622 / 636)
test set:
marked "Survived" of [Prediction/Label]'s Count : 33 / 31
marked "Death" of [Prediction/Label]'s Count : 45 / 47
Accuracy is : 97.43589743589743% (76 / 78)
第二名:没有使用参数网格:
//just for don't change code, meaning is not using parameter grid.
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1))
.addGrid(lr.threshold, Array(0.4))
.build()
结果日志是:
training set:
marked "Survived" of [Prediction/Label]'s Count : 259 / 259
marked "Death" of [Prediction/Label]'s Count : 377 / 377
Accuracy is : 100.0% (636 / 636)
test set:
marked "Survived" of [Prediction/Label]'s Count : 31 / 31
marked "Death" of [Prediction/Label]'s Count : 47 / 47
Accuracy is : 100.0% (78 / 78)
我的参数网格包含第二个单个参数值 因此Spark选举必须选择100%准确度模型 但97%的准确度模型似乎是当选的。
如果存在99%的模型,则选择98%的模型。那可以。因为模型评估的方法可能不符合准确度规范 但如果不是当选模型的准确率是100%,我认为这个故事有些不同。 在分类时,我认为100%的准确度意味着F1分数,准确度,混淆矩阵和其他评价的“完美” 测试集中甚至100% 所以我无法理解为什么100%模型不被选举。
代码:
//df_for_columns is DataFrame with preprocessed (drop some parameter columns, drop rows with age is 0)
val features = df_for_columns.columns
val lr = new LogisticRegression()
.setMaxIter(100)
.setFeaturesCol("features")
val assembler = new
VectorAssembler().setInputCols(features).setOutputCol("features")
val pipeline = new Pipeline().setStages(Array(assembler, lr))
//cross-validation check
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new BinaryClassificationEvaluator) //using naive BinaryclassificationEvaluator
.setEstimatorParamMaps(paramGrid) //paramGrid is builded by upper codes.
.setNumFolds(10)
//df_training is training set
val lrModel = cv.fit(df_training)
val bmodel = lrModel.bestModel
val result = bmodel.transform(df_training)
val result_test = bmodel.transform(df_test)
我的环境:
的IntelliJ(U)
Win7 x64
Spark 2.2.0
Scala 2.11.5
ML设置:
使用的功能是
1)pclass(票类,1 = 1,2,3 = 2,3 = 3)
2)年龄
3)sibsp(在泰坦尼克号上兄弟姐妹/喷出的数量)
4)parch(泰坦尼克号上父母/子女的数量)
5)票价(乘客票价)
和标签是“幸存”