汇编器将float视为字符串

时间:2018-11-26 19:58:26

标签: scala apache-spark

我有下面的示例代码。我正在尝试在Scala中建立一个ml管道。我的目标是进行简单的线性回归。当我尝试使用功能列表运行汇编器时,出现以下消息。我正在使用的功能都是浮点数,没有丢失值。以下是示例数据。我对Scala还是很陌生,我想知道问题是什么。汇编器在浮子上有麻烦吗?我正在使用Spark 2.3.0。

代码:

import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.tuning.{ParamGridBuilder, TrainValidationSplit}

// To see less warnings
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)


// Start a simple Spark Session
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()

// Prepare training and test data.
// val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("USA_Housing.csv")

val data = spark.read.option("header","true").option("inferSchema","true").format("csv").load("/Users/sshields/Desktop/stuff/udemy/spark/spark-for-big-data/Scala-and-Spark-Bootcamp-master/Machine_Learning_Sections/Regression/USA_Housing.csv")

// Check out the Data
data.printSchema()

// See an example of what the data looks like
// by printing out a Row
val colnames = data.columns
val firstrow = data.head(1)(0)
println("\n")
println("Example Data Row")
for(ind <- Range(1,colnames.length)){
  println(colnames(ind))
  println(firstrow(ind))
  println("\n")
}

////////////////////////////////////////////////////
//// Setting Up DataFrame for Machine Learning ////
//////////////////////////////////////////////////

// A few things we need to do before Spark can accept the data!
// It needs to be in the form of two columns
// ("label","features")

// This will allow us to join multiple feature columns
// into a single column of an array of feautre values
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

// Rename Price to label column for naming convention.
// Grab only numerical columns from the data
val df = data.select(data("Price").as("label"),$"Avg Area Income",$"Avg Area House Age",$"Avg Area Number of Rooms",$"Area Population")

// An assembler converts the input values to a vector
// A vector is what the ML algorithm reads to train a model

// Set the input columns from which we are supposed to read the values
// Set the name of the column where the vector will be stored
val assembler = new VectorAssembler().setInputCols(Array("Avg Area Income","Avg Area House Age","Avg Area Number of Rooms","Area Population")).setOutputCol("features")

// Use the assembler to transform our DataFrame to the two columns
val output = assembler.transform(df).select($"label",$"features")

数据:

Avg Area Income  Avg Area House Age  Avg Area Number of Rooms  \
0     79545.458574            5.682861                  7.009188   
1     79248.642455            6.002900                  6.730821   
2     61287.067179            5.865890                  8.512727   
3     63345.240046            7.188236                  5.586729   
4     59982.197226            5.040555                  7.839388   

   Avg Area Number of Bedrooms  Area Population         Price  \
0                         4.09     23086.800503  1.059034e+06   
1                         3.09     40173.072174  1.505891e+06   
2                         5.13     36882.159400  1.058988e+06   
3                         3.26     34310.242831  1.260617e+06   
4                         4.23     26354.109472  6.309435e+05   

                                             Address  
0  208 Michael Ferry Apt. 674\nLaurabury, NE 3701...  
1  188 Johnson Views Suite 079\nLake Kathleen, CA...  
2  9127 Elizabeth Stravenue\nDanieltown, WI 06482...  
3                          USS Barnett\nFPO AP 44820  
4                         USNS Raymond\nFPO AE 09386  

错误:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
df: org.apache.spark.sql.DataFrame = [label: double, Avg Area Income: string ... 3 more fields]
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_3e70ff1660b1
java.lang.IllegalArgumentException: Data type StringType of column Avg Area Income is not supported.
Data type StringType of column Avg Area House Age is not supported.
  at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:124)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
  at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54)
  ... 121 elided

0 个答案:

没有答案