Question

对于给定的数据框（df），我们通过df.schema获取模式，这是一个StructType数组。从spark-shell运行时，我可以将此模式保存到hdfs吗？另外，应该保存架构的最佳格式是什么？

Answer 1

Yes, you can save the schema as df.write.format("parquet").save("path") 
#Give path as a HDFS path

You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path

Parquet + compression is the best storage strategy whether it resides on S3 
or not.

Parquet is a columnar format, so it performs well without iterating over all 
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala

Answer 2

您可以使用treeString

schema = df._jdf.schema().treeString()

并将其转换为 RDD 并使用 saveAsTextFile：

sc.parallelize([schema ]).saveAsTextFile(...)

或者使用saveAsPickleFile：

temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")

将spark数据帧架构保存到hdfs

2 个答案: