如何在Spark Streaming中进行当天的汇总

时间:2019-02-13 14:00:39

标签: apache-spark spark-streaming

我有一个要求,每当有新记录到达时,我都需要为今天计算特定值的计数。

输入记录如下:

{"Floor_Id" : "Shop Floor 1",
"HaltRecord" : {
    "HaltReason" : "Danahydraulic Error",
    "Severity" : "Low",
    "FaultErrorCategory" : "Docked",
    "NonFaultErrorCategory" : null
},
"Description" : "Forklift",
"Category" : {
    "Type" : "Halt",
    "End_time" : NumberLong(2018-02-13T12:00:01),
    "Start_time" : NumberLong(2018-02-13T12:00:00)
},
"Asset_Id" : 123,
"isError" : "y",
"Timestamp": 2018-02-13T12:00:01}

输出响应应类似于:

{
    "Floor_Id": "Galileo_001",
    "Error_Category": [
        {
            "Category": "Operator Error",
            "DataPoints": 
                {
                    "NumberOfErrors": 20,
                    "Date": 2018-02-13
                }
        },
        {
            "Category": "Dana Hydraulic Error",
            "DataPoints": {
                    "NumberOfErrors": 15,
                    "Date": 2018-02-13
                }
        }
    ]
}

到目前为止,我正在从Kafka中读取记录,提取适当的字段并在数据框上应用过滤器。

val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParam)
    )

    val schema = StructType(Seq(StructField("Floor_Id",StringType,true),
      StructField("Category",StructType(Seq(StructField("Type",StringType,true),
        StructField("EndTime",LongType,true),StructField("StartTime",LongType,true))),true),
      StructField("HaltRecord",StructType(Seq(StructField("HaltReason",StringType,true),
        StructField("Severity",StringType,true),StructField("FaultErrorCategory",StringType,true),StructField("NonFaultErrorCategory",StringType,true))),true),
      StructField("Timestamp",StringType,true),
      StructField("Asset_Id",IntegerType,true),
        StructField("Description",StringType,true)
      StructField("IsError",StringType,true)


    stream.foreachRDD { rddRaw  =>
      val rdd = rddRaw.map(_.value.toString()) // or rddRaw.map(_._2)
    val linesDataFrame=rdd.toDF("value")
      val result =linesDataFrame.withColumn("value", from_json($"value", schema))
        .select($"value.Floor_Id",$"value.IsError", $"value.Category.Type", $"value.Timestamp", $"value.HaltRecord.HaltReason")

      val errorResult = result.map(row => (row.getString(0),row.getString(1),row.getString(2),row.getString(3),row.getString(4)))
        .filter(x=> (x._1 == "Shop Floor 1" && x._2 == "y" && x._3 == "Halt"))
        .map(x => (x._5,1L))
      println(errorResult)

      //println(errorResult.getClass) //Dataset
      result.printSchema()
      errorResult.show()


    }

需要有关如何进一步进行操作的建议,以便我可以汇总并继续更新当天的状态,并在第二天重置状态。

0 个答案:

没有答案