Question

我开始使用apache spark。我要求将json日志转换为扁平度量标准，也可以视为一个简单的csv。

例如。

  "orderId":1,
  "orderData": {
  "customerId": 123,
  "orders": [
    {
      "itemCount": 2,
      "items": [
        {
          "quantity": 1,
          "price": 315
        },
        {
          "quantity": 2,
          "price": 300
        },

      ]
    }
  ]
}

这可以被视为单个json日志，我想将其转换为，

orderId,customerId,totalValue,units
  1    ,   123    ,   915    ,  3

我正在浏览sparkSQL文档并可以使用它来获取单个值，例如“从订单中选择orderId，orderData.customerId”，但我不知道如何获得所有价格和单位的总和。

使用apache spark完成这项工作的最佳做法是什么？

Answer 1

尝试：

>>> from pyspark.sql.functions import *
>>> doc = {"orderData": {"orders": [{"items": [{"quantity": 1, "price": 315}, {"quantity": 2, "price": 300}], "itemCount": 2}], "customerId": 123}, "orderId": 1}
>>> df = sqlContext.read.json(sc.parallelize([doc]))
>>> df.select("orderId", "orderData.customerId", explode("orderData.orders").alias("order")) \
... .withColumn("item", explode("order.items")) \
... .groupBy("orderId", "customerId") \
... .agg(sum("item.quantity"), sum(col("item.quantity") * col("item.price")))

Answer 2

对于正在寻找上述java解决方案的人，请遵循：

SparkSession spark = SparkSession
            .builder()
            .config(conf)
            .getOrCreate();

    SQLContext sqlContext = new SQLContext(spark);

    Dataset<Row> orders = sqlContext.read().json("order.json");
    Dataset<Row> newOrders = orders.select(
            col("orderId"),
            col("orderData.customerId"),
            explode(col("orderData.orders")).alias("order"))
            .withColumn("item",explode(col("order.items")))
            .groupBy(col("orderId"),col("customerId"))
            .agg(sum(col("item.quantity")),sum(col("item.price")));
    newOrders.show();

在apache spark中从JSON日志创建聚合度量标准

2 个答案: