如何动态定义流数据集的模式以写入csv?

时间:2017-07-28 18:51:50

标签: scala apache-spark apache-kafka spark-structured-streaming spark-csv

我有一个流数据集,从kafka读取并尝试写入CSV

case class Event(map: Map[String,String])
def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation
val eventDataset: Dataset[Event] = spark
  .readStream
  .format("kafka")
  .load()
  .select("value")
  .as[Array[Byte]]
  .map(decodeEvent)

Event内置Map[String,String]并写入CSV我需要一些架构。

我们说所有字段都是String类型,所以我尝试了spark repo

中的示例
val columns = List("year","month","date","topic","field1","field2")
val schema = new StructType() //Prepare schema programmatically
columns.foreach { field => schema.add(field, "string") }
val rowRdd = eventDataset.rdd.map { event => Row.fromSeq(
     columns.map(c => event.getOrElse(c, "")
)}
val df = spark.sqlContext.createDataFrame(rowRdd, schema)

这会在运行时发生错误" eventDataset.rdd":

  

引起:org.apache.spark.sql.AnalysisException:查询   必须使用writeStream.start();;

执行流式源

以下不起作用,因为' .map'有一个List [String]而不是Tuple

eventDataset.map(event => columns.map(c => event.getOrElse(c,""))
.toDF(columns:_*)

有没有办法通过程序化架构和结构化流数据集实现这一目标?

1 个答案:

答案 0 :(得分:2)

我使用更简单的方法:

import org.apache.spark.sql.functions._

eventDataset.select(columns.map(
  c => coalesce($"map".getItem(c), lit("")).alias(c)
): _*).writeStream.format("csv").start(path)

但如果您想要更接近当前解决方案的内容,请跳过RDD转换

import org.apache.spark.sql.catalyst.encoders.RowEncoder

eventDataset.rdd.map(event =>
  Row.fromSeq(columns.map(c => event.getOrElse(c,"")))
)(RowEncoder(schema)).writeStream.format("csv").start(path)