Spark Streaming - Kafka- createStream - RDD到数据框架

时间:2017-04-19 08:27:24

标签: apache-kafka spark-streaming

我们有一个通过Kafka主题的数据流。我看过使用Spark Streaming。

  val ssc = new StreamingContext(l_sparkcontext, Seconds(30))
  val kafkaStream = KafkaUtils.createStream(ssc, "xxxx.xx.xx.com:2181", "new-spark-streaming-group", Map("event_log" -> 10))

这很好用。我想要的是通过为流数据分配列来写这个Parquet文件。因此,我做以下

kafkaStream.foreachRDD(rdd => {
  if (rdd.count() == 0 ) {
    println("No new SKU's received in this time interval " + Calendar.getInstance().getTime())
  }
  else {
    println("No of SKUs received " + rdd.count())
    rdd.map(record => {
      record._2
    }).toDF("customer_id","sku","type","event","control_group","event_date").write.mode(SaveMode.Append).format("parquet").save(outputPath)

然而这会产生错误

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (1): _1
New column names (6): customer_id, sku, type, event, control_group, event_date
    at scala.Predef$.require(Predef.scala:233)
    at org.apache.spark.sql.DataFrame.toDF(DataFrame.scala:224)
    at org.apache.spark.sql.DataFrameHolder.toDF(DataFrameHolder.scala:36)
    at kafka_receive_messages$$anonfun$main$1.apply(kafka_receive_messages.scala:77)
    at kafka_receive_messages$$anonfun$main$1.apply(kafka_receive_messages.scala:69)

我在做什么错误。我们应该在地图上分开吗?如果我们这样做,那么我们就不会将它转换为toDF(" ..列......")

感谢您的帮助。

此致

巴拉

1 个答案:

答案 0 :(得分:1)

感谢您的光临。我把它排除了。这是一个编码问题。对于那些希望将来执行此操作的人,请更改下面的其他部分

kafkaStream.foreachRDD(rdd => {
  if (rdd.count() == 0 ) {
    println("No new SKU's received in this time interval " + Calendar.getInstance().getTime())
  }
  else {
    println("No of SKUs received " + rdd.count())
    rdd.map(record => ( record._2).split(","))
  }.map(r => (r(0).replace(Quote,"").toInt,r(1).replace(Quote,"").toInt,r(2),r(3),r(4),r(5))).toDF("customer_id","sku","type","event","control_group","event_date").write.mode(SaveMode.Append).format("parquet").save(outputPath)
  })

再次感谢

巴拉