Spark .csv可变列数

时间:2017-02-13 10:15:23

标签: scala csv apache-spark

我有一个类似的案例类:

case class ResultDays (name: String, number: Double, values: Double*)

我希望将其保存到.csv文件

resultRDD.toDF()
  .coalesce(1)
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("res/output/result.csv")

不幸的是我有这个错误:

java.lang.UnsupportedOperationException: CSV data source does not support array<double> data type.

那么,如何插入可变数量的values并将其保存到.csv

1 个答案:

答案 0 :(得分:1)

如果您可以假设resultRDD中的所有记录在values中具有相同的列数 - 您可以阅读first()记录,使用它来确定数组中的值数,并将这些数组转换为单独的列:

// determine number of "extra" columns:
val extraCols = resultRDD.first().values.size

// create a sequence of desired columns:
val columns = Seq($"name", $"number") ++ (1 to extraCols).map(i => $"values"(i - 1) as s"col$i")

// select the above columns before saving:
resultRDD.toDF()
  .select(columns: _*)
  .coalesce(1)
  .write.format("com.databricks.spark.csv")
  .option("header", "true")
  .save("res/output/result.csv")

示例CSV结果类似于:

name,number,col1,col2
a,0.1,0.01,0.001
b,0.2,0.02,0.002
c,0.3,0.03,0.003