如何透过流数据集?

时间:2017-12-01 13:12:01

标签: apache-spark spark-structured-streaming apache-spark-2.0

我正在尝试调整Spark流式数据集(结构化流式传输),但我得到了AnalysisException(摘录如下)。

有人确认在结构化流(Spark 2.0)中确实不支持旋转,或许建议其他方法吗?

  

线程“main”中的异常org.apache.spark.sql.AnalysisException:必须使用writeStream.start();;执行带有流源的查询   卡夫卡       在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $ .org $ apache $ spark $ sql $ catalyst $ analysis $ UnsupportedOperationChecker $$ throwError(UnsupportedOperationChecker.scala:297)       在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:36)       在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:34)       在org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)

3 个答案:

答案 0 :(得分:3)

Spark Structured Streaming中不支持 tl; dr pivot聚合,包括 2.2.0 (并且似乎不是 2.3.0-SNAPSHOT 支持。

我今天使用主人建的Spark 2.3.0-SNAPSHOT。

scala> spark.version
res0: String = 2.3.0-SNAPSHOT

UnsupportedOperationChecker(您可以在堆栈跟踪中找到)检查流式查询的(逻辑计划)是否仅使用受支持的操作。

执行pivot时,您必须先groupBy,因为这是唯一可以pivot可用的界面。

pivot有两个问题:

  1. pivot想知道为数据集生成值的列数是多少,因此collect无法使用流式数据集生成值。

  2. pivot实际上是Spark Structured Streaming不支持的另一个聚合(groupBy旁边)

  3. 让我们看看问题1,没有列可以在定义上进行转动。

    val sq = spark
      .readStream
      .format("rate")
      .load
      .groupBy("value")
      .pivot("timestamp") // <-- pivot with no values
      .count
      .writeStream
      .format("console")
    scala> sq.start
    org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
    rate
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:351)
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:381)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:381)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:381)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:381)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
      at scala.collection.immutable.List.foreach(List.scala:381)
      at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:35)
      at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:64)
      at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:75)
      at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:73)
      at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:79)
      at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:79)
      at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:85)
      at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:81)
      at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
      at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
      at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3189)
      at org.apache.spark.sql.Dataset.collect(Dataset.scala:2665)
      at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:327)
      ... 49 elided
    

    最后两行显示了这个问题,即pivot does collect,因此存在问题。

    另一个问题是,即使你指定了要转移的列的值,然后由于multiple aggregations而得到另一个问题(你可以看到它&#39;实际上是检查streaming而非batch,就像第一种情况一样。)

    val sq = spark
      .readStream
      .format("rate")
      .load
      .groupBy("value")
      .pivot("timestamp", Seq(1)) // <-- pivot with explicit values
      .count
      .writeStream
      .format("console")
    scala> sq.start
    org.apache.spark.sql.AnalysisException: Multiple streaming aggregations are not supported with streaming DataFrames/Datasets;;
    Project [value#128L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#141[0] AS 1#142L]
    +- Aggregate [value#128L], [value#128L, pivotfirst(timestamp#127, count(1) AS `count`#137L, 1000000, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#141]
       +- Aggregate [value#128L, timestamp#127], [value#128L, timestamp#127, count(1) AS count(1) AS `count`#137L]
          +- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@5dd63368,rate,List(),None,List(),None,Map(),None), rate, [timestamp#127, value#128L]
    
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:351)
      at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:92)
      at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
      at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
      at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:278)
      ... 49 elided
    

答案 1 :(得分:0)

这是一个基于上述Jacek答案的简单Java示例:

JSON数组:

[{
        "customer_id": "d6315a00",
        "product": "Super widget",
        "price": 10,
        "bought_date": "2019-01-01"
    },
    {
        "customer_id": "d6315a00",
        "product": "Super widget",
        "price": 10,
        "bought_date": "2019-01-01"
    },
    {
        "customer_id": "d6315a00",
        "product": "Super widget",
        "price": 10,
        "bought_date": "2019-01-02"
    },
    {
        "customer_id": "d6315a00",
        "product": "Food widget",
        "price": 4,
        "bought_date": "2019-08-20"
    },
    {
        "customer_id": "d6315cd0",
        "product": "Food widget",
        "price": 4,
        "bought_date": "2019-09-19"
    }, {
        "customer_id": "d6315e2e",
        "product": "Bike widget",
        "price": 10,
        "bought_date": "2019-01-01"
    }, {
        "customer_id": "d6315a00",
        "product": "Bike widget",
        "price": 10,
        "bought_date": "2019-03-10"
    },
    {
        "customer_id": "d631614e",
        "product": "Garage widget",
        "price": 4,
        "bought_date": "2019-02-15"
    }
]

Java代码:

package io.centilliard;

import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.from_json;

import org.apache.spark.sql.AnalysisException;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.DataStreamWriter;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.ArrayType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import scala.Function2;
import scala.runtime.BoxedUnit;

public class Pivot {

    public static void main(String[] args) throws StreamingQueryException, AnalysisException {

        StructType schema = new StructType(new StructField[]{
                new StructField("customer_id", DataTypes.StringType, false, Metadata.empty()),  
                new StructField("product", DataTypes.StringType, false, Metadata.empty()),          
                new StructField("price", DataTypes.IntegerType, false, Metadata.empty()),               
                new StructField("bought_date", DataTypes.StringType, false, Metadata.empty())
            });

        ArrayType  arrayType = new ArrayType(schema, false);

        SparkSession spark = SparkSession
                .builder()
                .appName("SimpleExample")
                .getOrCreate();

        // Create a DataSet representing the stream of input lines from Kafka
        Dataset<Row> dataset = spark
                        .readStream()
                        .format("kafka")                
                        .option("kafka.bootstrap.servers", "localhost:9092")
                        .option("subscribe", "utilization")
                        .load()
                        .selectExpr("CAST(value AS STRING) as json");

        Column col = new Column("json");        
        Column data = from_json(col,arrayType).as("data");  
        Column explode = explode(data);
        Dataset<Row> customers = dataset.select(explode).select("col.*");

        DataStreamWriter<Row> dataStreamWriter = new DataStreamWriter<Row>(customers);

        StreamingQuery dataStream = dataStreamWriter.foreachBatch(new Function2<Dataset<Row>, Object, BoxedUnit>() {

            @Override
            public BoxedUnit apply(Dataset<Row> dataset, Object object) {               

                dataset
                .groupBy("customer_id","product","bought_date")
                .pivot("product")               
                .sum("price")               
                .orderBy("customer_id")
                .show();

                return null;
            }
        })
        .start();

        dataStream.awaitTermination();
    }

}

输出:

+-----------+-------------+-----------+-----------+-----------+-------------+------------+
|customer_id|      product|bought_date|Bike widget|Food widget|Garage widget|Super widget|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
|   d6315a00|  Bike widget| 2019-03-10|         20|       null|         null|        null|
|   d6315a00| Super widget| 2019-01-02|       null|       null|         null|          20|
|   d6315a00| Super widget| 2019-01-01|       null|       null|         null|          40|
|   d6315a00|  Food widget| 2019-08-20|       null|          8|         null|        null|
|   d6315cd0|  Food widget| 2019-09-19|       null|          8|         null|        null|
|   d6315e2e|  Bike widget| 2019-01-01|         20|       null|         null|        null|
|   d631614e|Garage widget| 2019-02-15|       null|       null|            8|        null|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+

答案 2 :(得分:0)

在大多数情况下,您可以使用条件聚合作为解决方法。 相当于

df.groupBy("timestamp").
   pivot("name", Seq("banana", "peach")).
   sum("value")

df.filter($"name".isin(Seq("banana", "peach"):_*)).
   groupBy("timestamp").
   agg(
     sum(when($"name".equalTo("banana"), $"value").
         otherwise("null")).
         cast(IntegerType).alias("banana"),
     sum(when($"name".equalTo("peach"), $"value").
         otherwise("null")).
         cast(IntegerType).alias("peach")
   )