将嵌套的json字符串转换为DataFrame

时间:2018-03-20 17:03:30

标签: scala apache-spark apache-spark-sql

我在DataFrame中有一个列,其中包含一个字符串格式的嵌套json

val df=Seq(("""{"-1":{"-1":[ 7420,0,20,22,0,0]}}""" ), ("""{"-1":{"-1":[1006,2,18,10,0,0]}}"""), ("""{"-1":{"-1":[6414,0,17,11,0,0]}}""")).toDF("column1")


+-------------------------------------+
|                              column1|           
+-------------------------------------+
|{"-1":{"-1":[7420, 0, 20, 22, 0, 0]}}|
|{"-1":{"-1":[1006, 2, 18, 10, 0, 0]}}|
|{"-1":{"-1":[6414, 0, 17, 11, 0, 0]}}|
+-----------------------+-------------+

I want to get a data frame that looks like this

+----+----+----+----+----+----+----+----+
|col1|col2|col3|col4|col5|col6|col7|col8|
+----+----+----+----+----+----+----+----+
|  -1|  -1|7420|   0|  20|  22|   0|   0|
|  -1|  -1|1006|   2|  18|  10|   0|   0|
|  -1|  -1|6414|   0|  17|  11|   0|   0|
+----+----+----+----+----+----+----+----+

我首先应用了给我的get_json_object

val df1= df.select(get_json_object($"column1", "$.-1")

+------------------------------+
|                       column1|           
+------------------------------+
|{"-1":[7420, 0, 20, 22, 0, 0]}|
|{"-1":[1006, 2, 18, 10, 0, 0]}|
|{"-1":[6414, 0, 17, 11, 0, 0]}|
+-----------------------+------+

所以我失去了第一个元素。

我尝试将现有元素转换为我想要的格式

val schema = new StructType()                              
.add("-1",                                         
MapType(                                             
  StringType,
  new StructType()
  .add("a1", StringType)
  .add("a2", StringType)
  .add("a3", StringType)
  .add("a4", StringType)
  .add("a5", StringType)
  .add("a6", StringType)
  .add("a7", StringType)
  .add("a8", StringType)
  .add("a9", StringType)
  .add("a10", StringType)
  .add("a11", StringType)
  .add("a11", StringType)))

df1.select(from_json($"new2", schema ))

但它返回了所有空值的1列DataFrame

2 个答案:

答案 0 :(得分:0)

您提供的JSON数据似乎无效

您可以更改为字符串的rdd并将所有"[]{}:替换为空,将:替换为,,以便创建逗号分隔的字符串并将其转换回数据帧,如下所示< / p>

  //data as you provided 
  val df = Seq(
    ("""{"-1":{"-1":[ 7420,0,20,22,0,0]}}"""),
    ("""{"-1":{"-1":[1006,2,18,10,0,0]}}"""),
    ("""{"-1":{"-1":[6414,0,17,11,0,0]}}""")
  ).toDF("column1")

  //create a schema 
  val schema = new StructType()
    .add("col1", StringType)
    .add("col2", StringType)
    .add("col3", StringType)
    .add("col4", StringType)
    .add("col5", StringType)
    .add("col6", StringType)
    .add("col7", StringType)
    .add("col8", StringType)
    /*.add("a9", StringType)
    .add("a10", StringType)
    .add("a11", StringType)
    .add("a11", StringType)*/

  //convert to rdd and replace using regex 
  val df2 = df.rdd.map(_.getString(0))
    .map(_.replaceAll("[\"|\\[|\\]|{|}]", "").replace(":", ","))
    .map(_.split(","))
    .map(x => (x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7)))
    .toDF(schema.fieldNames :_*)

OR

val rdd = df.rdd.map(_.getString(0))
    .map(_.replaceAll("[\"|\\[|\\]|{|}]", "").replace(":", ","))
    .map(_.split(","))
    .map(x => Row(x(0), x(1), x(2), x(3), x(4), x(5), x(6), x(7)))

  val finalDF = spark.sqlContext.createDataFrame(rdd, schema)

  df2.show()
  //or 
  finalDF.show()
  //will have a same output

输出:

+----+----+-----+----+----+----+----+----+
|col1|col2|col3 |col4|col5|col6|col7|col8|
+----+----+-----+----+----+----+----+----+
|-1  |-1  | 7420|0   |20  |22  |0   |0   |
|-1  |-1  |1006 |2   |18  |10  |0   |0   |
|-1  |-1  |6414 |0   |17  |11  |0   |0   |
+----+----+-----+----+----+----+----+----+

希望这有帮助!

答案 1 :(得分:0)

您只需使用from_json 内置函数即可将json字符串转换为实际的json对象,schema定义为StructType(Seq(StructField("-1", StructType(Seq(StructField("-1", ArrayType(IntegerType)))))))

import org.apache.spark.sql.functions._
val jsonedDF = df.select(from_json(col("column1"), StructType(Seq(StructField("-1", StructType(Seq(StructField("-1", ArrayType(IntegerType)))))))).as("json"))
jsonedDF.show(false)
//    +---------------------------------------+
//    |json                                   |
//    +---------------------------------------+
//    |[[WrappedArray(7420, 0, 20, 22, 0, 0)]]|
//    |[[WrappedArray(1006, 2, 18, 10, 0, 0)]]|
//    |[[WrappedArray(6414, 0, 17, 11, 0, 0)]]|
//    +---------------------------------------+
jsonedDF.printSchema()
//    root
//    |-- json: struct (nullable = true)
//    |    |-- -1: struct (nullable = true)
//    |    |    |-- -1: array (nullable = true)
//    |    |    |    |-- element: integer (containsNull = true)

之后只需选择合适的列并使用别名为列提供适当的名称

jsonedDF.select(
  lit("-1").as("col1"),
  lit("-1").as("col2"),
  col("json.-1.-1")(0).as("col3"),
  col("json.-1.-1")(1).as("col4"),
  col("json.-1.-1")(2).as("col5"),
  col("json.-1.-1")(3).as("col6"),
  col("json.-1.-1")(4).as("col7"),
  col("json.-1.-1")(5).as("col8")
).show(false)

应该会给你最后的dataframe

+----+----+----+----+----+----+----+----+
|col1|col2|col3|col4|col5|col6|col7|col8|
+----+----+----+----+----+----+----+----+
|-1  |-1  |7420|0   |20  |22  |0   |0   |
|-1  |-1  |1006|2   |18  |10  |0   |0   |
|-1  |-1  |6414|0   |17  |11  |0   |0   |
+----+----+----+----+----+----+----+----+

我使用 -1作为文字作为它们是json字符串中的键名并且总是相同的。