对于给定的数据框(df),我们通过df.schema获取模式,这是一个StructType数组。从spark-shell运行时,我可以将此模式保存到hdfs吗?另外,应该保存架构的最佳格式是什么?
答案 0 :(得分:0)
Yes, you can save the schema as df.write.format("parquet").save("path")
#Give path as a HDFS path
You can read also hdfs sqlContext.read.parquet("Path") #Give HDFS Path
Parquet + compression is the best storage strategy whether it resides on S3
or not.
Parquet is a columnar format, so it performs well without iterating over all
columns.
Please refer this link also https://stackoverflow.com/questions/34361222/dataframe-to-hdfs-in-spark-
scala
答案 1 :(得分:0)
您可以使用treeString
schema = df._jdf.schema().treeString()
并将其转换为 RDD 并使用 saveAsTextFile:
sc.parallelize([schema ]).saveAsTextFile(...)
或者使用saveAsPickleFile:
temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")