使用Spark解析为JSON

时间:2018-07-17 09:55:51

标签: json scala apache-spark

我已经从SQL Server中检索了一个表,其中包含300万条记录。

前10条记录:

+---------+-------------+----------+
|ACCOUNTNO|VEHICLENUMBER|CUSTOMERID|
+---------+-------------+----------+
| 10003014|    MH43AJ411|  20000000|
| 10003014|    MH43AJ411|  20000001|
| 10003015|   MH12GZ3392|  20000002|
| 10003016|    GJ15Z8173|  20000003|
| 10003018|    MH05AM902|  20000004|
| 10003019|   GJ15CD7657|  20001866|
| 10003019|   MH02BY7774|  20000005|
| 10003019|   MH02DG7774|  20000933|
| 10003019|   GJ15CA7387|  20001865|
| 10003019|   GJ15CB9601|  20001557|
+---------+-------------+----------+
only showing top 10 rows

这里ACCOUNTNO是唯一的,相同的ACCOUNTNO可能有一个以上的VEHICLENUMBER,对于每辆车,我们可能相对于该CUSTOMERID具有唯一的VEHICLENUMBER

我想导出为JSON格式。

这是我实现输出的代码:

package com.issuer.pack2.spark
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql._
object sqltojson {

  def main(args:Array[String])
    {
      System.setProperty("hadoop.home.dir", "C:/winutil/")
      val conf = new SparkConf().setAppName("SQLtoJSON").setMaster("local[*]")
      val sc = new SparkContext(conf)
      val sqlContext = new SQLContext(sc)
      import sqlContext.implicits._

      val jdbcSqlConnStr = "jdbc:sqlserver://192.168.70.88;databaseName=ISSUER;user=bhaskar;password=welcome123;"

      val jdbcDbTable = "[HISTORY].[TP_CUSTOMER_PREPAIDACCOUNTS]"
      val jdbcDF = sqlContext.read.format("jdbc").options(Map("url" -> jdbcSqlConnStr,"dbtable" -> jdbcDbTable)).load()
    //  jdbcDF.show(10)
      jdbcDF.registerTempTable("tp_customer_account")
      val res01 = sqlContext.sql("SELECT ACCOUNTNO, VEHICLENUMBER, CUSTOMERID FROM tp_customer_account GROUP BY ACCOUNTNO, VEHICLENUMBER, CUSTOMERID ORDER BY ACCOUNTNO ")

  // res01.show(10)
     res01.coalesce(1).write.json("D:/res01.json")    

    }
}

我得到的输出:

{"ACCOUNTNO":10003014,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001}
{"ACCOUNTNO":10003014,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000}
{"ACCOUNTNO":10003015,"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002}
{"ACCOUNTNO":10003016,"VEHICLENUMBER":"GJ15Z8173","CUSTOMERID":20000003}
{"ACCOUNTNO":10003018,"VEHICLENUMBER":"MH05AM902","CUSTOMERID":20000004}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"MH02BY7774","CUSTOMERID":20000005}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CA7387","CUSTOMERID":20001865}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7657","CUSTOMERID":20001866}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"MH02DG7774","CUSTOMERID":20000933}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CB9601","CUSTOMERID":20001557}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7387","CUSTOMERID":20029961}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CF7747","CUSTOMERID":20009020}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CB727","CUSTOMERID":20000008}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CA7837","CUSTOMERID":20001223}
{"ACCOUNTNO":10003019,"VEHICLENUMBER":"GJ15CD7477","CUSTOMERID":20001690}
{"ACCOUNTNO":10003020,"VEHICLENUMBER":"MH01AX5658","CUSTOMERID":20000006}
{"ACCOUNTNO":10003021,"VEHICLENUMBER":"GJ15AD727","CUSTOMERID":20000007}
{"ACCOUNTNO":10003023,"VEHICLENUMBER":"GU15PP7567","CUSTOMERID":20000009}
{"ACCOUNTNO":10003024,"VEHICLENUMBER":"GJ15CA7567","CUSTOMERID":20000010}
{"ACCOUNTNO":10003025,"VEHICLENUMBER":"GJ5JB9312","CUSTOMERID":20000011}

但是我想要这样的JSON格式输出: 我为上面表格的前三条记录手动编写了JSON(可能是我设计错误,我希望ACCOUNTNO应该是唯一的)。

{
    "ACCOUNTNO":10003014,
    "VEHICLE": [
        { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000000},
        { "VEHICLENUMBER":"MH43AJ411", "CUSTOMERID":20000001}
    ],
    "ACCOUNTNO":10003015,
    "VEHICLE": [
        { "VEHICLENUMBER":"MH12GZ3392", "CUSTOMERID":20000002}
    ]
}

那么,如何使用Spark代码实现这种JSON格式?我需要帮助。衷心感谢。

1 个答案:

答案 0 :(得分:1)

Scala spark-sql

您可以执行以下操作(不建议使用{{1}代替registerTempTable,因为createOrReplaceTempView已过时)

registerTempTable

您应该获得所需的输出为

jdbcDF.createGlobalTempView("tp_customer_account")
val res01 = sqlContext.sql("SELECT ACCOUNTNO, collect_list(struct(`VEHICLENUMBER`, `CUSTOMERID`)) as VEHICLE FROM tp_customer_account GROUP BY ACCOUNTNO ORDER BY ACCOUNTNO ")

res01.coalesce(1).write.json("D:/res01.json")

Scala spark API

使用 spark scala API ,您可以执行以下操作:

{"ACCOUNTNO":"10003014","VEHICLE":[{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":"20000000"},{"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":"20000001"}]}
{"ACCOUNTNO":"10003015","VEHICLE":[{"VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":"20000002"}]}
{"ACCOUNTNO":"10003016","VEHICLE":[{"VEHICLENUMBER":"GJ15Z8173","CUSTOMERID":"20000003"}]}
{"ACCOUNTNO":"10003018","VEHICLE":[{"VEHICLENUMBER":"MH05AM902","CUSTOMERID":"20000004"}]}
{"ACCOUNTNO":"10003019","VEHICLE":[{"VEHICLENUMBER":"GJ15CD7657","CUSTOMERID":"20001866"},{"VEHICLENUMBER":"MH02BY7774","CUSTOMERID":"20000005"},{"VEHICLENUMBER":"MH02DG7774","CUSTOMERID":"20000933"},{"VEHICLENUMBER":"GJ15CA7387","CUSTOMERID":"20001865"},{"VEHICLENUMBER":"GJ15CB9601","CUSTOMERID":"20001557"}]}

您应该得到与sql方式相同的答案。

我希望答案会有所帮助。