读取spark中的单行json,其中列键是可变的

时间:2017-05-13 10:22:23

标签: scala apache-spark hive apache-spark-sql

我有一行json文件,例如以下

$ dotnet msbuild /t:PublishAll /p:Configuration=Release
Microsoft (R) Build Engine version 15.1.1012.6693
Copyright (C) Microsoft Corporation. All rights reserved.

  app2 -> /Users/martin/testproj/app2/bin/Release/netcoreapp1.1/app2.dll
  app1 -> /Users/martin/testproj/app1/bin/Release/netcoreapp1.1/app1.dll
  app1 -> /Users/martin/testproj/app1/bin/Release/netcoreapp1.0/app1.dll

如果我使用以下内容读取json到spark上下文,则会产生

{"Hotel Dream":{"Guests":20,"Address":"14 Naik Street","City":"Manila"},"Serenity Stay":{"Guests":35,"Address":"10 St Marie Road","City":"Manila"}....}

我想转换不同的列(Hotel Dream,Serenity Stay等),以便数据帧最终成为正则化的架构

val hotelDF = sqlContext.read.json("file").printSchema

root
 |-- Hotel Dream: struct (nullable = true)
 |    |-- Address: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- Guests: long (nullable = true)
 |-- Serenity Stay: struct (nullable = true)
 |    |-- Address: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- Guests: long (nullable = true)

还尝试将json注释为textFile或wholeTextFiles。但由于没有换行符,我无法使用地图功能映射内容。

有关如何阅读此类数据格式的任何输入?

1 个答案:

答案 0 :(得分:0)

以下可以是我从您的问题中理解的解决方案(虽然它不是一个完美的解决方案)

var newDataFrame = Seq(("test", "test", "test", "test")).toDF("Hotel", "Address", "City", "Guests")
for(name <- hotelDF.schema.fieldNames) {
  val tempdf = hotelDF.withColumn("Hotel", lit(name))
    .withColumn("Address", hotelDF(name + ".Address"))
    .withColumn("City", hotelDF(name + ".City"))
    .withColumn("Guests", hotelDF(name + ".Guests"))
  val tdf = tempdf.select("Hotel", "Address", "City", "Guests")
  newDataFrame = newDataFrame.union(tdf)
}
newDataFrame.filter(!(col("Hotel") === "test")).show