我正在尝试在spark中运行简单的逻辑回归程序。 我收到了这个错误:我试图包含各种库来解决问题,但它没有解决问题。
java.lang.IllegalArgumentException:要求失败:列pmi 必须是org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7类型,但是 实际上是DoubleType。
这是我的dataSet csv
abc,pmi,sv,h,rh,label
0,4.267034,5,1.618187,5.213683,T
0,4.533071,24,3.540976,5.010458,F
0,6.357766,7,0.440152,5.592032,T
0,4.694365,1,0,6.953864,T
0,3.099447,2,0.994779,7.219463,F
0,1.482493,20,3.221419,7.219463,T
0,4.886681,4,0.919705,5.213683,F
0,1.515939,20,3.92588,6.329699,T
0,2.756057,9,2.841345,6.727063,T
0,3.341671,13,3.022361,5.601656,F
0,4.509981,7,1.538982,6.716471,T
0,4.039118,17,3.206316,6.392757,F
0,3.862023,16,3.268327,4.080564,F
0,5.026574,1,0,6.254859,T
0,3.186627,19,1.880978,8.466048,T
1,6.036507,8,1.376031,4.080564,F
1,5.026574,1,0,6.254859,T
1,-0.936022,23,2.78176,5.601656,F
1,6.435599,3,1.298795,3.408575,T
1,4.769222,3,1.251629,7.201824,F
1,3.190702,20,3.294354,6.716471,F
这是编辑的代码:
import java.io.IOException;
import org.apache.spark.ml.classification.LogisticRegression;
import org.apache.spark.ml.classification.LogisticRegressionModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.ml.linalg.VectorUDT;
import org.apache.spark.ml.feature.VectorAssembler;
public class Sp_LogistcRegression {
public void trainLogisticregression(String path, String model_path) throws IOException {
//SparkConf conf = new SparkConf().setAppName("Linear Regression Example");
// JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession.builder().appName("Sp_LogistcRegression").master("local[6]").config("spark.driver.memory", "3G").getOrCreate();
Dataset<Row> training = spark
.read()
.option("header", "true")
.option("inferSchema","true")
.csv(path);
String[] myStrings = {"abc",
"pmi", "sv", "h", "rh", "label"};
VectorAssembler VA = new VectorAssembler().setInputCols(myStrings ).setOutputCol("label");
Dataset<Row> transform = VA.transform(training);
LogisticRegression lr = new LogisticRegression().setMaxIter(1000).setRegParam(0.3);
LogisticRegressionModel lrModel = lr.fit( transform);
lrModel.save(model_path);
spark.close();
}
}
这是测试。
import java.io.File;
import java.io.IOException;
import org.junit.Test;
public class Sp_LogistcRegressionTest {
Sp_LogistcRegression spl =new Sp_LogistcRegression ();
@Test
public void test() throws IOException {
String filename = "datas/seg-large.csv";
ClassLoader classLoader = getClass().getClassLoader();
File file1 = new File(classLoader.getResource(filename).getFile());
spl. trainLogisticregression( file1.getAbsolutePath(), "/tmp");
}
}
更新 根据您的建议,我从数据集中删除了字符串值属性,即label。现在,我得到以下错误。
java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:264)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:58)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:263)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
答案 0 :(得分:1)
TL; DR 使用VectorAssembler变压器。
Spark MLlib的LogisticRegression
要求功能列为VectorUDT
类型(如错误消息所示)。
在Spark应用程序中,您从CSV文件中读取数据集,并且用于功能的字段属于不同类型。
请注意,我可以使用Spark MLlib不一定解释在这种情况下机器学习作为研究领域的推荐。
我的建议是使用一个变换器来映射列以匹配LogisticRegression
的要求。
快速浏览known transformers in Spark MLlib 2.1.1给我VectorAssembler。
将多个列合并到矢量列中的特征转换器。
这正是你所需要的。
(我使用Scala,我将代码重写为Java作为你的家庭练习)
val training: DataFrame = ...
// the following are to show that we're on the same page
val lr = new LogisticRegression().setFeaturesCol("pmi")
scala> lr.fit(training)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually IntegerType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
“休斯顿,我们遇到了问题。”我们先使用VectorAssembler
解决问题。
import org.apache.spark.ml.feature.VectorAssembler
val vecAssembler = new VectorAssembler().
setInputCols(Array("pmi")).
setOutputCol("features")
val features = vecAssembler.transform(training)
scala> features.show
+---+--------+
|pmi|features|
+---+--------+
| 5| [5.0]|
| 24| [24.0]|
+---+--------+
scala> features.printSchema
root
|-- pmi: integer (nullable = true)
|-- features: vector (nullable = true)
Whoohoo!我们有features
类型的vector
列!我们完成了吗?
是。但就我而言,当我使用spark-shell
进行实验时,由于lr
使用了错误的pmi
列(即类型不正确),因此无法立即生效。
scala> lr.fit(features)
java.lang.IllegalArgumentException: requirement failed: Column pmi must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually IntegerType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:51)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
让我们修复lr
以使用features
列。
请注意,features
列是默认列,因此我只需创建LogisticRegression
的新实例(我可以也使用setInputCol
)。
val lr = new LogisticRegression()
// it works but I've got no label column (with 0s and 1s and hence the issue)
// the main issue was fixed though, wasn't it?
scala> lr.fit(features)
java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:58)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:42)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.LogisticRegression.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265)
at org.apache.spark.ml.classification.LogisticRegression.validateAndTransformSchema(LogisticRegression.scala:278)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:144)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:100)
... 48 elided
问题的第一个版本更新后,又出现了另一个问题。
scala> va.transform(training)
java.lang.IllegalArgumentException: Data type StringType is not supported.
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 48 elided
原因是VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type。这意味着用于VectorAssembler
的其中一列是StringType类型。
在您的情况下,该列为label
,因为它是StringType
。查看架构。
scala> training.printSchema
root
|-- bc: integer (nullable = true)
|-- pmi: double (nullable = true)
|-- sv: integer (nullable = true)
|-- h: double (nullable = true)
|-- rh: double (nullable = true)
|-- label: string (nullable = true)
从列中删除它以用于VectorAssembler,错误消失。
但是,如果应该包含此列或任何其他列但类型不正确,则必须适当地进行转换(前提是列可以保留的值)。使用cast方法。
cast(to:String):Column 使用该类型的规范字符串表示形式将列转换为其他数据类型。支持的类型包括:
string
,boolean
,byte
,short
,int
,long
,float
,{{1 }},double
,decimal
,date
。
错误消息应该包括列名,但是目前它不是我提交的[SPARK-21285 VectorAssembler应该在不支持使用数据类型时报告列名| https://issues.apache.org/jira/browse/SPARK-21285]来修复它。如果您认为在即将推出的Spark版本中有价值,请投票支持。