我试图从Kinesis处理Json字符串。 Json字符串可以有几种不同的形式。从Kinesis,我创建了一个DStream:
val kinesisStream = KinesisUtils.createStream(
ssc, appName, "Kinesis_Stream", "kinesis.ap-southeast-1.amazonaws.com",
"region", InitialPositionInStream.LATEST, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_2)
val lines = kinesisStream.map(x => new String(x))
lines.foreachRDD((rdd, time) =>{
val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)
import sqlContext.implicits.StringToColumn
if(rdd.count() > 0){
// Process jsons here
// Json strings here would have either one of the formats below
}
})
RDD字符串将具有这些json字符串中的任何一个。 系列:
[
{
"data": {
"ApplicationVersion": "1.0.3 (65)",
"ProjectId": 30024,
"TargetId": "4138",
"Timestamp": 0
},
"host": "host1"
},
{
"data": {
"ApplicationVersion": "1.0.3 (65)",
"ProjectId": 30025,
"TargetId": "4139",
"Timestamp": 0
},
"host": "host1"
}
]
和一些Json字符串是单个对象,如下所示:
{
"ApplicationVersion": "1.0.3 (65)",
"ProjectId": 30026,
"TargetId": "4140",
"Timestamp": 0
}
我希望能够从"数据"中提取对象。 key如果它是第一种类型的Json字符串并与第二种类型的Json结合并形成RDD / DataFrame,我该如何实现呢?
最终我希望我的数据框是这样的:
+------------------+---------+--------+---------+
|ApplicationVersion|ProjectId|TargetId|Timestamp|
+------------------+---------+--------+---------+
| 1.0.3 (65)| 30024| 4138| 0|
| 1.0.3 (65)| 30025| 4139| 0|
| 1.0.3 (65)| 30026| 4140| 0|
+------------------+---------+--------+---------+
对不起,Scala和Spark新手。我一直在研究现有的例子,但遗憾的是找不到解决方案。
非常感谢提前。
答案 0 :(得分:0)
从第一个data.*
:
Dataframe
列后,您可以使用联合
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
// Assuming you store your jsons in two separate strings `json1` and `json2`
val df1 = spark.read.json(sc.parallelize(Seq(json1)))
val df2 = spark.read.json(sc.parallelize(Seq(json2)))
import spark.implicits._
df1.select($"data.*") // Select only the data columns from first Dataframe
.union(df2) // Union the two Dataframes as they have the same structure
.show()
编辑[其他解决方案链接]
编辑问题后,我了解到在解析JSON文件时需要某种回退机制。使用任何JSON解析库有更多方法可以做到这一点 - 有一个很好的解决方案here与Play,我认为它已经解释了如何以优雅的方式解决这个问题。
如果您有RDD[Data]
数据是“变体”类型,则可以使用Dataframe
将其转换为rdd.toDF()
。
希望有所帮助。
答案 1 :(得分:0)
此示例使用json4s
:
import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val format = DefaultFormats
case class jsonschema ( ApplicationVersion: String, ProjectId: String, TargetId: String, Timestamp:Int )
val string1 = """
[ {
"data" : {
"ApplicationVersion" : "1.0.3 (65)",
"ProjectId" : 30024,
"TargetId" : "4138",
"Timestamp" : 0
},
"host" : "host1"
}, {
"data" : {
"ApplicationVersion" : "1.0.3 (65)",
"ProjectId" : 30025,
"TargetId" : "4139",
"Timestamp" : 0
},
"host" : "host1"
} ]
"""
val string2 = """
[ {
"ApplicationVersion" : "1.0.3 (65)",
"ProjectId" : 30025,
"TargetId" : "4140",
"Timestamp" : 0
}, {
"ApplicationVersion" : "1.0.3 (65)",
"ProjectId" : 30025,
"TargetId" : "4141",
"Timestamp" : 0
} ]
"""
val json1 = (parse(string1) \ "data").extract[List[jsonschema]]
val json2 = parse(string2).extract[List[jsonschema]]
val jsonRDD = json1.union(json2)
val df = sqlContext.createDataFrame(jsonRDD)
df.show
+------------------+---------+--------+---------+
|ApplicationVersion|ProjectId|TargetId|Timestamp|
+------------------+---------+--------+---------+
| 1.0.3 (65)| 30024| 4138| 0|
| 1.0.3 (65)| 30025| 4139| 0|
| 1.0.3 (65)| 30025| 4140| 0|
| 1.0.3 (65)| 30025| 4141| 0|
+------------------+---------+--------+---------+