我有一个要求,每当有新记录到达时,我都需要为今天计算特定值的计数。
输入记录如下:
{"Floor_Id" : "Shop Floor 1",
"HaltRecord" : {
"HaltReason" : "Danahydraulic Error",
"Severity" : "Low",
"FaultErrorCategory" : "Docked",
"NonFaultErrorCategory" : null
},
"Description" : "Forklift",
"Category" : {
"Type" : "Halt",
"End_time" : NumberLong(2018-02-13T12:00:01),
"Start_time" : NumberLong(2018-02-13T12:00:00)
},
"Asset_Id" : 123,
"isError" : "y",
"Timestamp": 2018-02-13T12:00:01}
输出响应应类似于:
{
"Floor_Id": "Galileo_001",
"Error_Category": [
{
"Category": "Operator Error",
"DataPoints":
{
"NumberOfErrors": 20,
"Date": 2018-02-13
}
},
{
"Category": "Dana Hydraulic Error",
"DataPoints": {
"NumberOfErrors": 15,
"Date": 2018-02-13
}
}
]
}
到目前为止,我正在从Kafka中读取记录,提取适当的字段并在数据框上应用过滤器。
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParam)
)
val schema = StructType(Seq(StructField("Floor_Id",StringType,true),
StructField("Category",StructType(Seq(StructField("Type",StringType,true),
StructField("EndTime",LongType,true),StructField("StartTime",LongType,true))),true),
StructField("HaltRecord",StructType(Seq(StructField("HaltReason",StringType,true),
StructField("Severity",StringType,true),StructField("FaultErrorCategory",StringType,true),StructField("NonFaultErrorCategory",StringType,true))),true),
StructField("Timestamp",StringType,true),
StructField("Asset_Id",IntegerType,true),
StructField("Description",StringType,true)
StructField("IsError",StringType,true)
stream.foreachRDD { rddRaw =>
val rdd = rddRaw.map(_.value.toString()) // or rddRaw.map(_._2)
val linesDataFrame=rdd.toDF("value")
val result =linesDataFrame.withColumn("value", from_json($"value", schema))
.select($"value.Floor_Id",$"value.IsError", $"value.Category.Type", $"value.Timestamp", $"value.HaltRecord.HaltReason")
val errorResult = result.map(row => (row.getString(0),row.getString(1),row.getString(2),row.getString(3),row.getString(4)))
.filter(x=> (x._1 == "Shop Floor 1" && x._2 == "y" && x._3 == "Halt"))
.map(x => (x._5,1L))
println(errorResult)
//println(errorResult.getClass) //Dataset
result.printSchema()
errorResult.show()
}
需要有关如何进一步进行操作的建议,以便我可以汇总并继续更新当天的状态,并在第二天重置状态。