我有下面的示例代码。我正在尝试在Scala中建立一个ml管道。我的目标是进行简单的线性回归。当我尝试使用功能列表运行汇编器时,出现以下消息。我正在使用的功能都是浮点数,没有丢失值。以下是示例数据。我对Scala还是很陌生,我想知道问题是什么。汇编器在浮子上有麻烦吗?我正在使用Spark 2.3.0。
代码:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}
// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)
// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
// Prepare training and test data.
// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")
val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")
// Check out the Data
data.printSchema()
// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("\n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
println(colnames(ind))
println(firstrow(ind))
println("\n")
}
////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////
// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")
// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")
// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model
// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")
// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")
数据:
Avg Area Income Avg Area House Age Avg Area Number of Rooms \
0 79545.458574 5.682861 7.009188
1 79248.642455 6.002900 6.730821
2 61287.067179 5.865890 8.512727
3 63345.240046 7.188236 5.586729
4 59982.197226 5.040555 7.839388
Avg Area Number of Bedrooms Area Population Price \
0 4.09 23086.800503 1.059034e+06
1 3.09 40173.072174 1.505891e+06
2 5.13 36882.159400 1.058988e+06
3 3.26 34310.242831 1.260617e+06
4 4.23 26354.109472 6.309435e+05
Address
0 208 Michael Ferry Apt. 674\nLaurabury, NE 3701...
1 188 Johnson Views Suite 079\nLake Kathleen, CA...
2 9127 Elizabeth Stravenue\nDanieltown, WI 06482...
3 USS Barnett\nFPO AP 44820
4 USNS Raymond\nFPO AE 09386
错误:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
Data type StringType of column Avg Area House Age is not supported.
at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
... 121 elided