Spark结构化流分析JSON数据

时间:2019-12-17 12:15:27

标签: json apache-spark apache-spark-sql spark-structured-streaming

现在,我使用结构化流技术来对接Kafka数据。 Kafka中的数据为JSON格式。我得到的卡夫卡数据如下所示:

JSON数据

{"actly_payed":"300.0","total_amount":"2893.0","org_id":"8888","product_id":"4819569","payed_date":"2019-10-31 20:34:04","id":"200946364","order_id":"100233856","product_name":"test product_name"}

火花代码

Dataset<String> stringDataset = words.flatMap( new FlatMapFunction<String, String>() {
            @Override
            public Iterator<String> call(String s) throws Exception {
                List<String> list = new ArrayList<>(  );
                JSONObject jsonObject = handleJson( s );
//Sample JSON data: {"actly_payed":"300.0","total_amount":"2893.0","org_id":"8888","product_id":"4819569","payed_date":"2019-10-31 20:34:04","id":"200946364","order_id":"100233856","product_name":"test product_name"}

                for (Map.Entry<String, Object> entry : jsonObject.entrySet()) {
                    list.add(entry.getKey() + ":" + entry.getValue());
                }
                return list.iterator();
            }
        }, Encoders.STRING() );
}

我的手术结果如下:

此DF中只有一个值列,该值是JSON字符串

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"actly_payed":"300.0","total_amount":"2893.0","org_id":"8888","product_id":"4819569","payed_date":"2019-10-31 20:34:04","id":"200946364","order_id":"100233856","product_name":"test product_name"}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

我的问题

我的数据集中的值是一个JSON字符串(以键值的形式), 如何使用spark.sql ("select columns from tableName")查询数据。希望得到您的帮助

使用spark版本:2.3.0

使用的语言:Java

1 个答案:

答案 0 :(得分:0)

您只需要应用以下内容(标量示例):

val stringDataset: Dataset[String] = Seq(
  """{"actly_payed":"300.0","total_amount":"2893.0","org_id":"8888","product_id":"4819569","payed_date":"2019-10-31 20:34:04","id":"200946364","order_id":"100233856","product_name":"test product_name"}"""
).toDS

val df = spark.read.json(stringDataset.rdd)

df.show(false)

+-----------+---------+---------+------+-------------------+----------+-----------------+------------+
|actly_payed|id       |order_id |org_id|payed_date         |product_id|product_name     |total_amount|
+-----------+---------+---------+------+-------------------+----------+-----------------+------------+
|300.0      |200946364|100233856|8888  |2019-10-31 20:34:04|4819569   |test product_name|2893.0      |
+-----------+---------+---------+------+-------------------+----------+-----------------+------------+