我有一个用hdfs构建的json文件。我试图在我的spark上下文中读取json文件.json文件格式如下
root
|-- Request: struct (nullable = true)
| |-- FxRatesList: struct (nullable = true)
| | |-- FxRatesContract: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Currency: string (nullable = true)
| | | | |-- FxRate: string (nullable = true)
| |-- TrancheList: struct (nullable = true)
| | |-- Tranche: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Currency: string (nullable = true)
| | | | |-- OwnedAmt: string (nullable = true)
| | | | |-- Id: string (nullable = true)
| |-- baseCurrency: string (nullable = true)
| |-- isExcludeDeals: string (nullable = true)
printSchema向我显示以下输出:
Id OwnedAmt Currency
123 26500000 USD
456 41000000 USD
在json中创建trancheList部分的数据框/ RDD的最佳方法应该是什么,以便它为我提供一个独特的ID列表,其中包含OwnedAmt和Currency,看起来像下表
randomInitials: function () {
var colors = ["#e57373","#f06292","#ba68c8","#9575cd","#7986cb","#64b5f6","#4fc3f7","#4dd0e1","#4db6ac","#81c784","#aed581","#dce775","#fff176","#ffd54f","#ffb74d","#ff8a65","#a1887f","#e0e0e0","#90a4ae"];
return colors[Math.floor(Math.random()*colors.length)];
},
任何帮助都会很棒。 感谢
答案 0 :(得分:0)
您应该能够使用dot
表示法访问DataFrame层次结构中的列。
在此示例中,查询将类似于
// Spark 2.0 example; use registerTempTable for Spark 1.6
inputdf.createOrReplaceTempView("inputdf")
spark.sql("select Request.TrancheList.Tranche.Id, Request.TrancheList.Tranche.OwnedAmt, Request.TrancheList.Tranche.Currency from inputdf")
答案 1 :(得分:0)
以下是获取此数据的另一种方式。
val inputdf = spark.read.json("hdfs://localhost/user/xyz/request.json").select("Request.TrancheList.Tranche");
val dataDF = inputdf.select(explode(inputdf("Tranche"))).toDF("Tranche").select("Tranche.Id", "Tranche.OwnedAmt","Tranche.Currency")
dataDF.show