Question

I have a data source which is stored as a large number of gzipped, csv files. The header info for this source is a separate file.

I'd like to load this data into spark for manipulation - is there an easy way to get spark to figure out the schema/load the headers? There are literally hundreds of columns, and they might change between runs, would strongly prefer not to do this by hand

Answer 1

This can easily be done in spark : if your header file is : headers.csv and it only contains header then simply first load this file with header set as true :

val headerCSV  = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/header.csv")

then get the Columns out in the form of Array:

val columns = headerCSV.columns

Then read the other file without the header information and pass this file as the header:

spark.read.format("CSV").load("/home/shivansh/Desktop/fileWithoutHeader.csv").toDF(columns:_*)

This will result in the DF with the combined value !

Can I auto-load csv headers from a separate file for a scala spark window on Zeppelin?

1 个答案: