Question

我有一行json文件，例如以下

$ dotnet msbuild /t:PublishAll /p:Configuration=Release
Microsoft (R) Build Engine version 15.1.1012.6693
Copyright (C) Microsoft Corporation. All rights reserved.

  app2 -> /Users/martin/testproj/app2/bin/Release/netcoreapp1.1/app2.dll
  app1 -> /Users/martin/testproj/app1/bin/Release/netcoreapp1.1/app1.dll
  app1 -> /Users/martin/testproj/app1/bin/Release/netcoreapp1.0/app1.dll

如果我使用以下内容读取json到spark上下文，则会产生

{"Hotel Dream":{"Guests":20,"Address":"14 Naik Street","City":"Manila"},"Serenity Stay":{"Guests":35,"Address":"10 St Marie Road","City":"Manila"}....}

我想转换不同的列（Hotel Dream，Serenity Stay等），以便数据帧最终成为正则化的架构

val hotelDF = sqlContext.read.json("file").printSchema

root
 |-- Hotel Dream: struct (nullable = true)
 |    |-- Address: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- Guests: long (nullable = true)
 |-- Serenity Stay: struct (nullable = true)
 |    |-- Address: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- Guests: long (nullable = true)

还尝试将json注释为textFile或wholeTextFiles。但由于没有换行符，我无法使用地图功能映射内容。

有关如何阅读此类数据格式的任何输入？

Answer 1

以下可以是我从您的问题中理解的解决方案（虽然它不是一个完美的解决方案）

var newDataFrame = Seq(("test", "test", "test", "test")).toDF("Hotel", "Address", "City", "Guests")
for(name <- hotelDF.schema.fieldNames) {
  val tempdf = hotelDF.withColumn("Hotel", lit(name))
    .withColumn("Address", hotelDF(name + ".Address"))
    .withColumn("City", hotelDF(name + ".City"))
    .withColumn("Guests", hotelDF(name + ".Guests"))
  val tdf = tempdf.select("Hotel", "Address", "City", "Guests")
  newDataFrame = newDataFrame.union(tdf)
}
newDataFrame.filter(!(col("Hotel") === "test")).show

读取spark中的单行json，其中列键是可变的

1 个答案: