Condense several columns read with Spark CSV

时间:2016-07-11 19:18:12

标签: scala csv apache-spark

I have data like the following in a CSV file:

ColumnA,1,2,3,2,1
"YYY",242,34234,232,322,432
"ZZZ",16,435,363,3453,3434

I want to read it with https://github.com/databricks/spark-csv

I would like to read this into a DataFrame and condense all the columns except the first one into a Seq.

So I would like to obtain something like this from it:

MyCaseClass("YYY", Seq(242,34234,232,322,432))
MyCaseClass("ZZZ", Seq(16,435,363,3453,3434))

I'm not sure how to obtain that.

I tried reading like this, where url is the location of the file:

val rawData = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(url)

Then, I am mapping it into the values that I want.

The problem is that I get the error:

The header contains a duplicate entry: '1'

So how can I condense all the fields except the first into a Seq using spark-csv?

EDIT

I can not change the format of the input.

1 个答案:

答案 0 :(得分:-1)

you can do by mapping over row . And also as Pawel's comment duplicate column name is not allowed. So, you can do like :

val dataFrame = yourCSV_DataFrame

dataFrame.map{row =>
Row(row(0), Seq(row(1), row(2), row(3) ...))
}