Spark Streaming Scala结合了不同结构的json以形成DataFrame

时间:2017-07-14 10:31:28

标签: json scala apache-spark

我试图从Kinesis处理Json字符串。 Json字符串可以有几种不同的形式。从Kinesis,我创建了一个DStream:

val kinesisStream = KinesisUtils.createStream(
 ssc, appName, "Kinesis_Stream", "kinesis.ap-southeast-1.amazonaws.com",
 "region", InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)

 val lines = kinesisStream.map(x => new String(x))

 lines.foreachRDD((rdd, time) =>{

   val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
   import sqlContext.implicits.StringToColumn

   if(rdd.count() > 0){
    // Process jsons here
    // Json strings here would have either one of the formats below
   }
 })

RDD字符串将具有这些json字符串中的任何一个。 系列:

[
  {
    "data": {
      "ApplicationVersion": "1.0.3 (65)",
      "ProjectId": 30024,
      "TargetId": "4138",
      "Timestamp": 0
    },
    "host": "host1"
  },
  {
    "data": {
      "ApplicationVersion": "1.0.3 (65)",
      "ProjectId": 30025,
      "TargetId": "4139",
      "Timestamp": 0
    },
    "host": "host1"
  }
]

和一些Json字符串是单个对象,如下所示:

{
      "ApplicationVersion": "1.0.3 (65)",
      "ProjectId": 30026,
      "TargetId": "4140",
      "Timestamp": 0
}

我希望能够从"数据"中提取对象。 key如果它是第一种类型的Json字符串并与第二种类型的Json结合并形成RDD / DataFrame,我该如何实现呢?

最终我希望我的数据框是这样的:

+------------------+---------+--------+---------+
|ApplicationVersion|ProjectId|TargetId|Timestamp|
+------------------+---------+--------+---------+
|        1.0.3 (65)|    30024|    4138|        0|
|        1.0.3 (65)|    30025|    4139|        0|
|        1.0.3 (65)|    30026|    4140|        0|
+------------------+---------+--------+---------+

对不起,Scala和Spark新手。我一直在研究现有的例子,但遗憾的是找不到解决方案。

非常感谢提前。

2 个答案:

答案 0 :(得分:0)

从第一个data.*

中选择Dataframe列后,您可以使用联合
val spark = SparkSession.builder().master("local[*]").getOrCreate()    
val sc = spark.sparkContext

// Assuming you store your jsons in two separate strings `json1` and `json2`
val df1 = spark.read.json(sc.parallelize(Seq(json1)))
val df2 = spark.read.json(sc.parallelize(Seq(json2)))

import spark.implicits._
df1.select($"data.*") // Select only the data columns from first Dataframe
  .union(df2)         // Union the two Dataframes as they have the same structure
  .show()

编辑[其他解决方案链接]

编辑问题后,我了解到在解析JSON文件时需要某种回退机制。使用任何JSON解析库有更多方法可以做到这一点 - 有一个很好的解决方案here与Play,我认为它已经解释了如何以优雅的方式解决这个问题。

如果您有RDD[Data]数据是“变体”类型,则可以使用Dataframe将其转换为rdd.toDF()

希望有所帮助。

答案 1 :(得分:0)

此示例使用json4s

import org.json4s._
import org.json4s.jackson.JsonMethods._

implicit val format = DefaultFormats

case class jsonschema ( ApplicationVersion: String, ProjectId: String, TargetId: String, Timestamp:Int )

val string1 = """
[ {
  "data" : {
    "ApplicationVersion" : "1.0.3 (65)",
    "ProjectId" : 30024,
    "TargetId" : "4138",
    "Timestamp" : 0
  },
  "host" : "host1"
}, {
  "data" : {
    "ApplicationVersion" : "1.0.3 (65)",
    "ProjectId" : 30025,
    "TargetId" : "4139",
    "Timestamp" : 0
  },
  "host" : "host1"
} ]

"""

val string2 = """
[ {
  "ApplicationVersion" : "1.0.3 (65)",
  "ProjectId" : 30025,
  "TargetId" : "4140",
  "Timestamp" : 0
}, {
  "ApplicationVersion" : "1.0.3 (65)",
  "ProjectId" : 30025,
  "TargetId" : "4141",
  "Timestamp" : 0
} ]
"""

val json1 = (parse(string1) \ "data").extract[List[jsonschema]]

val json2 = parse(string2).extract[List[jsonschema]]

val jsonRDD = json1.union(json2)

val df = sqlContext.createDataFrame(jsonRDD)

df.show


+------------------+---------+--------+---------+
|ApplicationVersion|ProjectId|TargetId|Timestamp|
+------------------+---------+--------+---------+
|        1.0.3 (65)|    30024|    4138|        0|
|        1.0.3 (65)|    30025|    4139|        0|
|        1.0.3 (65)|    30025|    4140|        0|
|        1.0.3 (65)|    30025|    4141|        0|
+------------------+---------+--------+---------+