我在DataFrame中有一个列,其中包含一个字符串格式的嵌套json
val df=Seq(("""{"-1":{"-1":[ 7420,0,20,22,0,0]}}""" ), ("""{"-1":{"-1":[1006,2,18,10,0,0]}}"""), ("""{"-1":{"-1":[6414,0,17,11,0,0]}}""")).toDF("column1")
+-------------------------------------+
| column1|
+-------------------------------------+
|{"-1":{"-1":[7420, 0, 20, 22, 0, 0]}}|
|{"-1":{"-1":[1006, 2, 18, 10, 0, 0]}}|
|{"-1":{"-1":[6414, 0, 17, 11, 0, 0]}}|
+-----------------------+-------------+
I want to get a data frame that looks like this
+----+----+----+----+----+----+----+----+
|col1|col2|col3|col4|col5|col6|col7|col8|
+----+----+----+----+----+----+----+----+
| -1| -1|7420| 0| 20| 22| 0| 0|
| -1| -1|1006| 2| 18| 10| 0| 0|
| -1| -1|6414| 0| 17| 11| 0| 0|
+----+----+----+----+----+----+----+----+
我首先应用了给我的get_json_object
val df1= df.select(get_json_object($"column1", "$.-1")
+------------------------------+
| column1|
+------------------------------+
|{"-1":[7420, 0, 20, 22, 0, 0]}|
|{"-1":[1006, 2, 18, 10, 0, 0]}|
|{"-1":[6414, 0, 17, 11, 0, 0]}|
+-----------------------+------+
所以我失去了第一个元素。
我尝试将现有元素转换为我想要的格式
val schema = new StructType()
.add("-1",
MapType(
StringType,
new StructType()
.add("a1", StringType)
.add("a2", StringType)
.add("a3", StringType)
.add("a4", StringType)
.add("a5", StringType)
.add("a6", StringType)
.add("a7", StringType)
.add("a8", StringType)
.add("a9", StringType)
.add("a10", StringType)
.add("a11", StringType)
.add("a11", StringType)))
df1.select(from_json($"new2", schema ))
但它返回了所有空值的1列DataFrame
答案 0 :(得分:0)
您提供的JSON数据似乎无效
您可以更改为字符串的rdd并将所有"[]{}:
替换为空,将:
替换为,
,以便创建逗号分隔的字符串并将其转换回数据帧,如下所示< / p>
//data as you provided
val df = Seq(
("""{"-1":{"-1":[ 7420,0,20,22,0,0]}}"""),
("""{"-1":{"-1":[1006,2,18,10,0,0]}}"""),
("""{"-1":{"-1":[6414,0,17,11,0,0]}}""")
).toDF("column1")
//create a schema
val schema = new StructType()
.add("col1", StringType)
.add("col2", StringType)
.add("col3", StringType)
.add("col4", StringType)
.add("col5", StringType)
.add("col6", StringType)
.add("col7", StringType)
.add("col8", StringType)
/*.add("a9", StringType)
.add("a10", StringType)
.add("a11", StringType)
.add("a11", StringType)*/
//convert to rdd and replace using regex
val df2 = df.rdd.map(_.getString(0))
.map(_.replaceAll("[\"|\\[|\\]|{|}]", "").replace(":", ","))
.map(_.split(","))
.map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7)))
.toDF(schema.fieldNames :_*)
OR
val rdd = df.rdd.map(_.getString(0))
.map(_.replaceAll("[\"|\\[|\\]|{|}]", "").replace(":", ","))
.map(_.split(","))
.map(x => Row(x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7)))
val finalDF = spark.sqlContext.createDataFrame(rdd, schema)
df2.show()
//or
finalDF.show()
//will have a same output
输出:
+----+----+-----+----+----+----+----+----+
|col1|col2|col3 |col4|col5|col6|col7|col8|
+----+----+-----+----+----+----+----+----+
|-1 |-1 | 7420|0 |20 |22 |0 |0 |
|-1 |-1 |1006 |2 |18 |10 |0 |0 |
|-1 |-1 |6414 |0 |17 |11 |0 |0 |
+----+----+-----+----+----+----+----+----+
希望这有帮助!
答案 1 :(得分:0)
您只需使用from_json
内置函数即可schema
定义为StructType(Seq(StructField("-1", StructType(Seq(StructField("-1", ArrayType(IntegerType)))))))
import org.apache.spark.sql.functions._
val jsonedDF = df.select(from_json(col("column1"), StructType(Seq(StructField("-1", StructType(Seq(StructField("-1", ArrayType(IntegerType)))))))).as("json"))
jsonedDF.show(false)
// +---------------------------------------+
// |json |
// +---------------------------------------+
// |[[WrappedArray(7420, 0, 20, 22, 0, 0)]]|
// |[[WrappedArray(1006, 2, 18, 10, 0, 0)]]|
// |[[WrappedArray(6414, 0, 17, 11, 0, 0)]]|
// +---------------------------------------+
jsonedDF.printSchema()
// root
// |-- json: struct (nullable = true)
// | |-- -1: struct (nullable = true)
// | | |-- -1: array (nullable = true)
// | | | |-- element: integer (containsNull = true)
之后只需选择合适的列并使用别名为列提供适当的名称
jsonedDF.select(
lit("-1").as("col1"),
lit("-1").as("col2"),
col("json.-1.-1")(0).as("col3"),
col("json.-1.-1")(1).as("col4"),
col("json.-1.-1")(2).as("col5"),
col("json.-1.-1")(3).as("col6"),
col("json.-1.-1")(4).as("col7"),
col("json.-1.-1")(5).as("col8")
).show(false)
应该会给你最后的dataframe
+----+----+----+----+----+----+----+----+
|col1|col2|col3|col4|col5|col6|col7|col8|
+----+----+----+----+----+----+----+----+
|-1 |-1 |7420|0 |20 |22 |0 |0 |
|-1 |-1 |1006|2 |18 |10 |0 |0 |
|-1 |-1 |6414|0 |17 |11 |0 |0 |
+----+----+----+----+----+----+----+----+
我使用 -1作为文字作为它们是json字符串中的键名并且总是相同的。