在spark中解析分层json到dataFrame

时间:2016-11-16 04:19:47

标签: apache-spark dataframe rdd

我有一个用hdfs构建的json文件。我试图在我的spark上下文中读取json文件.json文件格式如下

root
 |-- Request: struct (nullable = true)
 |    |-- FxRatesList: struct (nullable = true)
 |    |    |-- FxRatesContract: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Currency: string (nullable = true)
 |    |    |    |    |-- FxRate: string (nullable = true)
 |    |-- TrancheList: struct (nullable = true)
 |    |    |-- Tranche: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- Currency: string (nullable = true)
 |    |    |    |    |-- OwnedAmt: string (nullable = true)
 |    |    |    |    |-- Id: string (nullable = true)
 |    |-- baseCurrency: string (nullable = true)
 |    |-- isExcludeDeals: string (nullable = true)

printSchema向我显示以下输出:

  Id       OwnedAmt       Currency
    123      26500000        USD
    456      41000000        USD

在json中创建trancheList部分的数据框/ RDD的最佳方法应该是什么,以便它为我提供一个独特的ID列表,其中包含OwnedAmt和Currency,看起来像下表

randomInitials: function () {
    var colors = ["#e57373","#f06292","#ba68c8","#9575cd","#7986cb","#64b5f6","#4fc3f7","#4dd0e1","#4db6ac","#81c784","#aed581","#dce775","#fff176","#ffd54f","#ffb74d","#ff8a65","#a1887f","#e0e0e0","#90a4ae"];
    return colors[Math.floor(Math.random()*colors.length)];
},

任何帮助都会很棒。 感谢

2 个答案:

答案 0 :(得分:0)

您应该能够使用dot表示法访问DataFrame层次结构中的列。

在此示例中,查询将类似于

// Spark 2.0 example; use registerTempTable for Spark 1.6
inputdf.createOrReplaceTempView("inputdf")

spark.sql("select Request.TrancheList.Tranche.Id, Request.TrancheList.Tranche.OwnedAmt, Request.TrancheList.Tranche.Currency from inputdf")

答案 1 :(得分:0)

以下是获取此数据的另一种方式。

val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json").select("Request.TrancheList.Tranche");
val dataDF = inputdf.select(explode(inputdf("Tranche"))).toDF("Tranche").select("Tranche.Id", "Tranche.OwnedAmt","Tranche.Currency")
dataDF.show