I have a data source which is stored as a large number of gzipped, csv files. The header info for this source is a separate file.
I'd like to load this data into spark for manipulation - is there an easy way to get spark to figure out the schema/load the headers? There are literally hundreds of columns, and they might change between runs, would strongly prefer not to do this by hand
答案 0 :(得分:4)
This can easily be done in spark : if your header file is : headers.csv and it only contains header then simply first load this file with header set as true :
val headerCSV = spark.read.format("CSV").option("header","true").load("/home/shivansh/Desktop/header.csv")
then get the Columns out in the form of Array:
val columns = headerCSV.columns
Then read the other file without the header information and pass this file as the header:
spark.read.format("CSV").load("/home/shivansh/Desktop/fileWithoutHeader.csv").toDF(columns:_*)
This will result in the DF with the combined value !