我需要使用Spark在一列中保存Map(键值对)。要求是其他人可能将数据与PIG等其他工具一起使用,因此最好使用通用格式而不是特殊格式的字符串来保存Map。我使用以下代码创建列:
StructField("cMap", DataTypes.createMapType(StringType, StringType), true) ::
然后在我创建数据框后,我得到了架构:
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
然后我将数据框保存到Json:
df.write.json(path)
我发现Json输出是:
"cMap":{"1":"a","2":"b","3":"c"}
所以一旦我下次从文件中读到它:
val new_df = sqlContext.read.json(path)
我得到了架构:
|-- cMap: struct (nullable = true)
| |-- 1: string
| |-- 2: string
| |-- 3: string
有没有有效的方法来保存和读取Json中的地图而无需额外处理(我可以将地图保存为特殊字符串并对其进行解码,但我认为它不应该那么复杂)。感谢。
答案 0 :(得分:0)
您可以将表格保存为parquet
文件
写:
df.write.parquet( “mydf.parquet”)
读
val new_df = spark.read.parquet(“mydf.parquet”)
// Encoders for most common types are automatically provided by importing spark.implicits._ import spark.implicits._ val peopleDF = spark.read.json("examples/src/main/resources/people.json") // DataFrames can be saved as Parquet files, maintaining the schema information peopleDF.write.parquet("people.parquet") // Read in the parquet file created above // Parquet files are self-describing so the schema is preserved // The result of loading a Parquet file is also a DataFrame val parquetFileDF = spark.read.parquet("people.parquet") // Parquet files can also be used to create a temporary view and then used in SQL statements parquetFileDF.createOrReplaceTempView("parquetFile") val namesDF = spark.sql("SELECT name FROM parquetFile WHERE age BETWEEN 13 AND 19") namesDF.map(attributes => "Name: " + attributes(0)).show()
答案 1 :(得分:0)
Parquet
格式应解决您遇到的问题。 Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression
只需将其保存到Parquet
,如下所示
df.write.mode(SaveMode.Overwrite).parquet("path to the output")
并阅读如下
val new_df = sqlContext.read.parquet("path to the above output")
我希望这会有所帮助