基于结构流的实时流作业

时间:2020-09-07 13:16:17

标签: apache-spark spark-structured-streaming

我使用结构流从kafka读取json数据,并且一些窗口时间序列数据存储在json数据中。 json格式如下:

{"id": "fd78sfsdfsd8vs", 
 "item": [{"data_identifier": "algid1_set1_totalcount_lstm",
           "time_series": [{"time": "20200903 00:00:00", "value": 342342.12},
                           {"time": "20200903 00:00:05", "value": 342421.88},
                           {"time": "20200903 00:00:10", "value": 351232.92}]},
          {"data_identifier": "algid2_set2_totalcount_lstm",
           "time_series": [{"time": "20200903 00:00:00", "value": 342342.12},
                           {"time": "20200903 00:00:05", "value": 342421.88},
                           {"time": "20200903 00:00:10", "value": 351232.92}]}
         ]
}

然后,我处理json数据以获得一个DataFrame,并对DataFrame中的时间序列数据执行异常检测。 DataFrame如下:

+--------------+----------------------+-----------------+---------+
|            id|data_identifier_method|             time|    value|
+--------------+----------------------+-----------------+---------+
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:05|342421.88|
|fd78sfsdfsd8vs|  algid2_set2_total...|20200903 00:00:10|351232.92|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:00|342342.12|
|fd78sfsdfsd8vs|  algid1_set1_total...|20200903 00:00:05|342421.88|
+--------------+----------------------+-----------------+---------+

由于结构流的特性,我希望每个json都独立处理,与其他json无关。我想知道我的想法是否可以实现?如果可能的话如何实现。

0 个答案:

没有答案