将spark数据帧架构保存到hdfs

时间:2018-01-13 00:00:11

标签: hdfs spark-dataframe

对于给定的数据框(df),我们通过df.schema获取模式,这是一个StructType数组。从spark-shell运行时,我可以将此模式保存到hdfs吗?另外,应该保存架构的最佳格式是什么?

2 个答案:

答案 0 :(得分:0)

Yes, you can save the schema as df.write.format("parquet").save("path") 
#Give path as a HDFS path

You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path

Parquet + compression is the best storage strategy whether it resides on S3 
or not.

Parquet is a columnar format, so it performs well without iterating over all 
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala

答案 1 :(得分:0)

您可以使用treeString

schema = df._jdf.schema().treeString()

并将其转换为 RDD 并使用 saveAsTextFile

sc.parallelize([schema ]).saveAsTextFile(...)

或者使用saveAsPickleFile

temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")