仅使用Spark-Scala RDD将JSON展平为表格结构

时间:2017-05-28 07:00:06

标签: scala apache-spark rdd

我已经嵌套了JSON,并希望以表格结构输出。我能够单独解析JSON值,但在表格化方面存在一些问题。我可以通过数据框轻松完成。但我想用#34; RDD ONLY"功能。任何帮助非常感谢。

输入JSON:

  { "level":{"productReference":{  

     "prodID":"1234",

     "unitOfMeasure":"EA"

  },

  "states":[  
     {  
        "state":"SELL",
        "effectiveDateTime":"2015-10-09T00:55:23.6345Z",
        "stockQuantity":{  
           "quantity":1400.0,
           "stockKeepingLevel":"A"
        }
     },
     {  
        "state":"HELD",
        "effectiveDateTime":"2015-10-09T00:55:23.6345Z",
        "stockQuantity":{  
           "quantity":800.0,
           "stockKeepingLevel":"B"
        }
     }
  ] }}

预期产出:

enter image description here

我尝试了下面的Spark代码。但是获取这样的输出和Row()对象无法解析它。

079562193,EA,List(SELLABLE,HELD),List(2015-10-09T00:55:23.6345Z,2015-10-09T00:55:23.6345Z),List(1400.0,800.0),List(SINGLE, SINGLE)

def main(Args : Array[String]): Unit = {

  val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)

  val salesSchema = StructType(Array(
    StructField("prodID", StringType, true),
    StructField("unitOfMeasure", StringType, true),
    StructField("state", StringType, true),
    StructField("effectiveDateTime", StringType, true),
    StructField("quantity", StringType, true),
    StructField("stockKeepingLevel", StringType, true)
  ))

  val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")

  val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {

        parse(eachJsonMessages)

      }).map(insideEachJson=>{
        implicit  val formats = org.json4s.DefaultFormats

       val prodID = (insideEachJson\ "level" \"productReference" \"TPNB").extract[String].toString
       val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString

       val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
          map(x=>(x\"state").extract[String]).toString()
       val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
         map(x=>(x\"effectiveDateTime").extract[String]).toString
      val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
         map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
         toString
      val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
         map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
       toString

      //Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)

    println(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)

      }).collect()

    //  sqlContext.createDataFrame(x,salesSchema).show(truncate = false)

}

3 个答案:

答案 0 :(得分:4)

以下是我开发的“DATAFRAME”解决方案。寻找完整的“RDD ONLY”解决方案

def main (Args : Array[String]):Unit = {

    val conf = new SparkConf().setAppName("JSON Read and Write using Spark DataFrame few more options").setMaster("local[1]")
    val sc = new SparkContext(conf)
    val sqlContext = new SQLContext(sc)

    val sourceJsonDF = sqlContext.read.json("product.json")

         val jsonFlatDF_level = sourceJsonDF.withColumn("explode_states",explode($"level.states"))
        .withColumn("explode_link",explode($"level._link"))
      .select($"level.productReference.TPNB".as("TPNB"),
        $"level.productReference.unitOfMeasure".as("level_unitOfMeasure"),
        $"level.locationReference.location".as("level_location"),
        $"level.locationReference.type".as("level_type"),
        $"explode_states.state".as("level_state"),
        $"explode_states.effectiveDateTime".as("level_effectiveDateTime"),
        $"explode_states.stockQuantity.quantity".as("level_quantity"),
        $"explode_states.stockQuantity.stockKeepingLevel".as("level_stockKeepingLevel"),
        $"explode_link.rel".as("level_rel"),
        $"explode_link.href".as("level_href"),
        $"explode_link.method".as("level_method"))
jsonFlatDF_oldLevel.show()

  }

答案 1 :(得分:2)

DataFrameDataSet远远超过optimized rdd,并且有很多options可以尝试达到我们想要的解决方案。

在我看来,DataFrame的开发是为了让开发人员能够以表格形式轻松查看数据,以便轻松实现逻辑。因此,我始终建议用户使用dataframedataset

更少说话,我使用dataframe向您发送以下解决方案。获得dataframe后,切换到rdd非常简单。

您需要的解决方案如下(您必须找到一种方法来阅读json文件,因为它完成了以下json string:这就是你的任务:)祝你好运)

import org.apache.spark.sql.functions._
val json = """  { "level":{"productReference":{

                  "prodID":"1234",

                  "unitOfMeasure":"EA"

               },

               "states":[
                  {
                     "state":"SELL",
                     "effectiveDateTime":"2015-10-09T00:55:23.6345Z",
                     "stockQuantity":{
                        "quantity":1400.0,
                        "stockKeepingLevel":"A"
                     }
                  },
                  {
                     "state":"HELD",
                     "effectiveDateTime":"2015-10-09T00:55:23.6345Z",
                     "stockQuantity":{
                        "quantity":800.0,
                        "stockKeepingLevel":"B"
                     }
                  }
               ] }}"""

val rddJson = sparkContext.parallelize(Seq(json))
var df = sqlContext.read.json(rddJson)
df = df.withColumn("prodID", df("level.productReference.prodID"))
  .withColumn("unitOfMeasure", df("level.productReference.unitOfMeasure"))
  .withColumn("states", explode(df("level.states")))
  .drop("level")
df = df.withColumn("state", df("states.state"))
  .withColumn("effectiveDateTime", df("states.effectiveDateTime"))
  .withColumn("quantity", df("states.stockQuantity.quantity"))
  .withColumn("stockKeepingLevel", df("states.stockQuantity.stockKeepingLevel"))
  .drop("states")
df.show(false)

这将作为

给出
+------+-------------+-----+-------------------------+--------+-----------------+
|prodID|unitOfMeasure|state|effectiveDateTime        |quantity|stockKeepingLevel|
+------+-------------+-----+-------------------------+--------+-----------------+
|1234  |EA           |SELL |2015-10-09T00:55:23.6345Z|1400.0  |A                |
|1234  |EA           |HELD |2015-10-09T00:55:23.6345Z|800.0   |B                |
+------+-------------+-----+-------------------------+--------+-----------------+

现在,您希望输出为dataframe转换为rdd只是致电.rdd

df.rdd.foreach(println)

将输出如下

[1234,EA,SELL,2015-10-09T00:55:23.6345Z,1400.0,A]
[1234,EA,HELD,2015-10-09T00:55:23.6345Z,800.0,B]

我希望这很有用

答案 2 :(得分:0)

您的问题有两种版本的解决方案。

版本1:

def main(Args : Array[String]): Unit = {

  val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)

  val salesSchema = StructType(Array(
    StructField("prodID", StringType, true),
    StructField("unitOfMeasure", StringType, true),
    StructField("state", StringType, true),
    StructField("effectiveDateTime", StringType, true),
    StructField("quantity", StringType, true),
    StructField("stockKeepingLevel", StringType, true)
  ))

  val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")    

  val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {

    parse(eachJsonMessages)

  }).map(insideEachJson=>{
    implicit  val formats = org.json4s.DefaultFormats

   val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
   val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString

   val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
      map(x=>(x\"state").extract[String]).toString()
   val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
     map(x=>(x\"effectiveDateTime").extract[String]).toString
  val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
     map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double]).
     toString
  val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
     map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String]).
   toString

  Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)

  })

    sqlContext.createDataFrame(x,salesSchema).show(truncate = false)

}

这会给你以下输出:

+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|prodID|unitOfMeasure|state           |effectiveDateTime                                         |quantity           |stockKeepingLevel|
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+
|1234  |EA           |List(SELL, HELD)|List(2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z)|List(1400.0, 800.0)|List(A, B)       |
+------+-------------+----------------+----------------------------------------------------------+-------------------+-----------------+

版本2

def main(Args : Array[String]): Unit = {

  val conf = new SparkConf().setAppName("JSON Read and Write using Spark RDD").setMaster("local[1]")
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)

  val salesSchema = StructType(Array(
    StructField("prodID", StringType, true),
    StructField("unitOfMeasure", StringType, true),
    StructField("state", ArrayType(StringType, true), true),
    StructField("effectiveDateTime", ArrayType(StringType, true), true),
    StructField("quantity", ArrayType(DoubleType, true), true),
    StructField("stockKeepingLevel", ArrayType(StringType, true), true)
  ))

  val ReadAlljsonMessageInFile_RDD = sc.textFile("product_rdd.json")    

  val x = ReadAlljsonMessageInFile_RDD.map(eachJsonMessages => {

    parse(eachJsonMessages)

  }).map(insideEachJson=>{
    implicit  val formats = org.json4s.DefaultFormats

   val prodID = (insideEachJson\ "level" \"productReference" \"prodID").extract[String].toString
   val unitOfMeasure = (insideEachJson\ "level" \ "productReference" \"unitOfMeasure").extract[String].toString

   val state= (insideEachJson \ "level" \"states").extract[List[JValue]].
      map(x=>(x\"state").extract[String])
   val effectiveDateTime= (insideEachJson \ "level" \"states").extract[List[JValue]].
     map(x=>(x\"effectiveDateTime").extract[String])
  val quantity= (insideEachJson \ "level" \"states").extract[List[JValue]].
     map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"quantity").extract[Double])
  val stockKeepingLevel= (insideEachJson \ "level" \"states").extract[List[JValue]].
     map(x=>(x\"stockQuantity").extract[JValue]).map(x=>(x\"stockKeepingLevel").extract[String])

  Row(prodID,unitOfMeasure,state,effectiveDateTime,quantity,stockKeepingLevel)

  })


    sqlContext.createDataFrame(x,salesSchema).show(truncate = false)

}

这会给你以下输出:

+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|prodID|unitOfMeasure|state       |effectiveDateTime                                     |quantity       |stockKeepingLevel|
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+
|1234  |EA           |[SELL, HELD]|[2015-10-09T00:55:23.6345Z, 2015-10-09T00:55:23.6345Z]|[1400.0, 800.0]|[A, B]           |
+------+-------------+------------+------------------------------------------------------+---------------+-----------------+

版本1和版本之间的区别2是架构。在版本1中,您将每列都转换为String,而在版本2中,它们将转换为Array