我正在尝试调整Spark流式数据集(结构化流式传输),但我得到了AnalysisException
(摘录如下)。
有人确认在结构化流(Spark 2.0)中确实不支持旋转,或许建议其他方法吗?
线程“main”中的异常org.apache.spark.sql.AnalysisException:必须使用writeStream.start();;执行带有流源的查询 卡夫卡 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $ .org $ apache $ spark $ sql $ catalyst $ analysis $ UnsupportedOperationChecker $$ throwError(UnsupportedOperationChecker.scala:297) 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:36) 在org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker $$ anonfun $ checkForBatch $ 1.apply(UnsupportedOperationChecker.scala:34) 在org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
答案 0 :(得分:3)
pivot
聚合,包括 2.2.0 (并且似乎不是 2.3.0-SNAPSHOT 支持。
我今天使用主人建的Spark 2.3.0-SNAPSHOT。
scala> spark.version
res0: String = 2.3.0-SNAPSHOT
UnsupportedOperationChecker(您可以在堆栈跟踪中找到)检查流式查询的(逻辑计划)是否仅使用受支持的操作。
执行pivot
时,您必须先groupBy
,因为这是唯一可以pivot
可用的界面。
pivot
有两个问题:
pivot
想知道为数据集生成值的列数是多少,因此collect
无法使用流式数据集生成值。
pivot
实际上是Spark Structured Streaming不支持的另一个聚合(groupBy
旁边)
让我们看看问题1,没有列可以在定义上进行转动。
val sq = spark
.readStream
.format("rate")
.load
.groupBy("value")
.pivot("timestamp") // <-- pivot with no values
.count
.writeStream
.format("console")
scala> sq.start
org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
rate
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:351)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:37)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$$anonfun$checkForBatch$1.apply(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:126)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForBatch(UnsupportedOperationChecker.scala:35)
at org.apache.spark.sql.execution.QueryExecution.assertSupported(QueryExecution.scala:64)
at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:85)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:81)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:90)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:90)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3189)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2665)
at org.apache.spark.sql.RelationalGroupedDataset.pivot(RelationalGroupedDataset.scala:327)
... 49 elided
最后两行显示了这个问题,即pivot
does collect
,因此存在问题。
另一个问题是,即使你指定了要转移的列的值,然后由于multiple aggregations而得到另一个问题(你可以看到它&#39;实际上是检查streaming而非batch,就像第一种情况一样。)
val sq = spark
.readStream
.format("rate")
.load
.groupBy("value")
.pivot("timestamp", Seq(1)) // <-- pivot with explicit values
.count
.writeStream
.format("console")
scala> sq.start
org.apache.spark.sql.AnalysisException: Multiple streaming aggregations are not supported with streaming DataFrames/Datasets;;
Project [value#128L, __pivot_count(1) AS `count` AS `count(1) AS ``count```#141[0] AS 1#142L]
+- Aggregate [value#128L], [value#128L, pivotfirst(timestamp#127, count(1) AS `count`#137L, 1000000, 0, 0) AS __pivot_count(1) AS `count` AS `count(1) AS ``count```#141]
+- Aggregate [value#128L, timestamp#127], [value#128L, timestamp#127, count(1) AS count(1) AS `count`#137L]
+- StreamingRelation DataSource(org.apache.spark.sql.SparkSession@5dd63368,rate,List(),None,List(),None,Map(),None), rate, [timestamp#127, value#128L]
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.org$apache$spark$sql$catalyst$analysis$UnsupportedOperationChecker$$throwError(UnsupportedOperationChecker.scala:351)
at org.apache.spark.sql.catalyst.analysis.UnsupportedOperationChecker$.checkForStreaming(UnsupportedOperationChecker.scala:92)
at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:278)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:278)
... 49 elided
答案 1 :(得分:0)
这是一个基于上述Jacek答案的简单Java示例:
JSON数组:
[{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-01"
},
{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-01"
},
{
"customer_id": "d6315a00",
"product": "Super widget",
"price": 10,
"bought_date": "2019-01-02"
},
{
"customer_id": "d6315a00",
"product": "Food widget",
"price": 4,
"bought_date": "2019-08-20"
},
{
"customer_id": "d6315cd0",
"product": "Food widget",
"price": 4,
"bought_date": "2019-09-19"
}, {
"customer_id": "d6315e2e",
"product": "Bike widget",
"price": 10,
"bought_date": "2019-01-01"
}, {
"customer_id": "d6315a00",
"product": "Bike widget",
"price": 10,
"bought_date": "2019-03-10"
},
{
"customer_id": "d631614e",
"product": "Garage widget",
"price": 4,
"bought_date": "2019-02-15"
}
]
Java代码:
package io.centilliard;
import static org.apache.spark.sql.functions.explode;
import static org.apache.spark.sql.functions.from_json;
import org.apache.spark.sql.AnalysisException;
import org.apache.spark.sql.Column;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.streaming.DataStreamWriter;
import org.apache.spark.sql.streaming.StreamingQuery;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.ArrayType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import scala.Function2;
import scala.runtime.BoxedUnit;
public class Pivot {
public static void main(String[] args) throws StreamingQueryException, AnalysisException {
StructType schema = new StructType(new StructField[]{
new StructField("customer_id", DataTypes.StringType, false, Metadata.empty()),
new StructField("product", DataTypes.StringType, false, Metadata.empty()),
new StructField("price", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("bought_date", DataTypes.StringType, false, Metadata.empty())
});
ArrayType arrayType = new ArrayType(schema, false);
SparkSession spark = SparkSession
.builder()
.appName("SimpleExample")
.getOrCreate();
// Create a DataSet representing the stream of input lines from Kafka
Dataset<Row> dataset = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "utilization")
.load()
.selectExpr("CAST(value AS STRING) as json");
Column col = new Column("json");
Column data = from_json(col,arrayType).as("data");
Column explode = explode(data);
Dataset<Row> customers = dataset.select(explode).select("col.*");
DataStreamWriter<Row> dataStreamWriter = new DataStreamWriter<Row>(customers);
StreamingQuery dataStream = dataStreamWriter.foreachBatch(new Function2<Dataset<Row>, Object, BoxedUnit>() {
@Override
public BoxedUnit apply(Dataset<Row> dataset, Object object) {
dataset
.groupBy("customer_id","product","bought_date")
.pivot("product")
.sum("price")
.orderBy("customer_id")
.show();
return null;
}
})
.start();
dataStream.awaitTermination();
}
}
输出:
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
|customer_id| product|bought_date|Bike widget|Food widget|Garage widget|Super widget|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
| d6315a00| Bike widget| 2019-03-10| 20| null| null| null|
| d6315a00| Super widget| 2019-01-02| null| null| null| 20|
| d6315a00| Super widget| 2019-01-01| null| null| null| 40|
| d6315a00| Food widget| 2019-08-20| null| 8| null| null|
| d6315cd0| Food widget| 2019-09-19| null| 8| null| null|
| d6315e2e| Bike widget| 2019-01-01| 20| null| null| null|
| d631614e|Garage widget| 2019-02-15| null| null| 8| null|
+-----------+-------------+-----------+-----------+-----------+-------------+------------+
答案 2 :(得分:0)
在大多数情况下,您可以使用条件聚合作为解决方法。 相当于
df.groupBy("timestamp").
pivot("name", Seq("banana", "peach")).
sum("value")
是
df.filter($"name".isin(Seq("banana", "peach"):_*)).
groupBy("timestamp").
agg(
sum(when($"name".equalTo("banana"), $"value").
otherwise("null")).
cast(IntegerType).alias("banana"),
sum(when($"name".equalTo("peach"), $"value").
otherwise("null")).
cast(IntegerType).alias("peach")
)