如何将spark dataframe [double,String]转换为LabeledPoint?

时间:2019-04-14 08:05:34

标签: apache-spark-sql apache-spark-mllib apache-spark-ml

以下是正在尝试的代码。我正在尝试将csv中的SalesData转换为DF,然后转换为LabeledPoints。但是,在最后一步中,出现以下编译错误

程序包宏包含具有相同名称的对象和程序包:blackbox

您能告诉我这里做错了什么吗?谢谢

-编辑-
通过向build.gradle添加2.11 mllib解决了编译问题。但是mlData.show失败

错误:java.lang.ClassCastException:java.lang.String无法转换为org.apache.spark.ml.linalg.Vector

    val path = "SalesData.csv"
    val conf = new SparkConf().setMaster("local[2]").set("deploy-mode", "client").set("spark.driver.bindAddress", "127.0.0.1")
      .set("spark.broadcast.compress", "false")
      .setAppName("local-spark-kafka-consumer-client")
  val sparkSession = SparkSession
      .builder()
      .config(conf)
      .getOrCreate()
    val data = sparkSession.read.format("csv").option("header", "true").option("inferSchema", "true").load(path)
    data.cache()
    import org.apache.spark.sql.DataFrameNaFunctions
    data.na.drop()
    data.show

    //get monthly sales totals 
    val summary = data.select("OrderMonthYear","SaleAmount").groupBy("OrderMonthYear").sum().orderBy("OrderMonthYear").toDF("OrderMonthYear","SaleAmount")
    summary.show

    // convert ordermonthyear to integer type
    //val results = summary.map(df => (df.getAs[String]("OrderMonthYear").replace("-", "") , df.getAs[String]("SaleAmount"))).toDF(["OrderMonthYear","SaleAmount"])
    import org.apache.spark.sql.functions._
    val test = summary.withColumn("OrderMonthYear", (regexp_replace(col("OrderMonthYear").cast("String"),"-",""))).toDF("OrderMonthYear","SaleAmount")
    test.printSchema()
    test.show
    import sparkSession.implicits._
    val mlData = test.select("OrderMonthYear", "SaleAmount").
                  map(row => org.apache.spark.ml.feature.LabeledPoint(
                              row.getAs[Double](1),
                              row.getAs[org.apache.spark.ml.linalg.Vector](0))).toDF
mlData.show

0 个答案:

没有答案