我是新手,正在尝试解决以下问题。我们非常感谢您的帮助。
我有以下杰森。
{
"index": "identity",
"type": "identity",
"id": "100000",
"source": {
"link_data": {
"source_Id": "0011245"
},
"attribute_data": {
"first": {
"val": [
true
],
"updated_at": "2011"
},
"second": {
"val": [
true
],
"updated_at": "2010"
}
}
}
}
“ attribute_data”下的属性可能会有所不同。它可以有另一个属性,例如“第三”
我希望结果采用以下格式:
_index _type _id source_Id attribute_data val updated_at
ID ID randomid 00000 first true 2000-08-08T07:51:14Z
ID ID randomid 00000 second true 2010-08-08T07:51:14Z
我尝试了以下方法。
val df = spark.read.json("sample.json")
val res = df.select("index","id","type","source.attribute_data.first.updated_at", "source.attribute_data.first.val", "source.link_data.source_id");
它只是添加新列而不是如下行
index id type updated_at val source_id
identity 100000 identity 2011 [true] 0011245
答案 0 :(得分:0)
尝试以下操作:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = spark.read.json("sample.json")
df.select($"id", $"index", $"source.link_data.source_Id".as("source_Id"),$"source.attribute_data.first.val".as("first"), explode($"source.attribute_data.second.val").as("second"), $"type")
.select($"id", $"index", $"source_Id", $"second", explode($"first"), $"type").show
答案 1 :(得分:0)
在这里您可以解决问题。随时询问是否需要了解任何内容:
val data = spark.read.json("sample.json")
val readJsonDf = data.select($"index", $"type", $"id", $"source.link_data.source_id".as("source_id"), $"source.attribute_data.*")
readJsonDf.show()
初始输出:
+--------+--------+------+---------+--------------------+--------------------+
| index| type| id|source_id| first| second|
+--------+--------+------+---------+--------------------+--------------------+
|identity|identity|100000| 0011245|[2011,WrappedArra...|[2010,WrappedArra...|
+--------+--------+------+---------+--------------------+--------------------+
然后,我使用以下代码行进行了动态转换:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
def transposeColumnstoRows(df: DataFrame, constantCols: Seq[String]): DataFrame = {
val (cols, types) = df.dtypes.filter{ case (c, _) => !constantCols.contains(c)}.unzip
//a check if the required columns that needs to be transformed to rows are of the same structure
require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")
val keyColsWIthValues = explode(array(
cols.map(c => struct(lit(c).alias("columnKey"), col(c).alias("value"))): _*
))
df.select(constantCols.map(col(_)) :+ keyColsWIthValues.alias("keyColsWIthValues"): _*)
}
val newDf = transposeColumnstoRows(readJsonDf, Seq("index","type","id","source_id"))
val requiredDf = newDf.select($"index",$"type",$"id",$"source_id",$"keyColsWIthValues.columnKey".as("attribute_data"),$"keyColsWIthValues.value.updated_at".as("updated_at"),$"keyColsWIthValues.value.val".as("val"))
requiredDf.show()
最终输出:
| index| type| id|source_id|attribute_data|updated_at| val|
+--------+--------+------+---------+--------------+----------+------+
|identity|identity|100000| 0011245| first| 2011|[true]|
|identity|identity|100000| 0011245| second| 2010|[true]|
希望这可以解决您的问题!