我有两个版本的Spark代码。第一种使用带有Kafka源的结构化流:
dfStream.printSchema()
//root
//|-- dt: string (nullable = true)
//|-- ip: string (nullable = true)
//|-- device: string (nullable = true)
val dfWindowed = dfStream
.groupBy($"ip")
.agg(concat_ws(",", collect_list($"device")).alias("devices"))
.writeStream
.outputMode("complete")
.format("memory")
.start()
第二个从文件读取。但是数据确实与上面相同:
logDF.printSchema()
//root
//|-- dt: string (nullable = true)
//|-- ip: string (nullable = true)
//|-- device: string (nullable = true)
logDF.repartition(32)
.groupBy("ip")
.agg(concat_ws(",", collect_list($"device")).alias("devices"))
问题是,尽管第二个方法运行良好,但第一个一直给我以下错误:
at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:284)
at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:177)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0 failed 4 times, most recent failure: Lost task 9.3 in stage 1.0 (TID 28, c3-hadoop-prc-st3417.bj, executor 3): java.lang.RuntimeException: Collect cannot be used in partial aggregations.
很长的句子..但错误似乎如下:
java.lang.RuntimeException: Collect cannot be used in partial aggregations.
我发现了几个相关的SO问题,但是到目前为止,没有解决方案。关于以下内容的任何建议将不胜感激:
答案 0 :(得分:0)
我想您可以考虑将解决方法视为groupByKey -> reduceGroup
链,例如:
case class Data(ip: Int, column1: String, column2: String)
import spark.implicits._
val path = "/tmp/spark-streaming/test-data"
Seq(
(1, "val1", "field1"),
(1, "val2", "field2"),
(1, "val3", "field3"),
(1, "val4", "field4"),
(2, "val1", "field1"),
(3, "val1", "field1"),
(4, "val1", "field1"),
(4, "val2", "field2")
).toDF("ip", "column1", "column2").write.mode("overwrite").parquet(path)
spark.read.parquet(path).printSchema()
spark.read.parquet(path).show(false)
spark.sql("SET spark.sql.streaming.schemaInference=true")
val stream = spark.readStream.parquet(path).as[Data]
val result =
stream
.groupByKey(_.ip)
.reduceGroups { (l, r) =>
l.copy(column1 = l.column1.concat(",").concat(r.column1), column2 = l.column2.concat(",").concat(r.column2))
}
.map(_._2)
result.printSchema()
result.writeStream
.option("checkpointLocation", "/tmp/spark-streaming-checkpoint-test")
.option("truncate", "false")
.format("console")
.outputMode("update")
.start()
.awaitTermination(300000)
Seq(
(1, "val5", "field5"),
(2, "val2", "field2"),
(3, "val2", "field2"),
(4, "val3", "field3")
).toDF("ip", "column1", "column2").write.mode("append").parquet(path)
这将导致如下所示:
+---+-------------------+---------------------------+
|ip |column1 |column2 |
+---+-------------------+---------------------------+
|1 |val1,val2,val3,val4|field1,field2,field3,field4|
|3 |val1 |field1 |
|4 |val1,val2 |field1,field2 |
|2 |val1 |field1 |
+---+-------------------+---------------------------+
注意:从2.3.1开始,完整模式不支持聚合操作
希望有帮助!
答案 1 :(得分:0)
我最终按照建议here编写了UDAF。
class ConcatString extends UserDefinedAggregateFunction {
// This is the input fields for your aggregate function.
override def inputSchema: org.apache.spark.sql.types.StructType =
StructType(StructField("value", StringType) :: Nil)
// This is the internal fields you keep for computing your aggregate.
override def bufferSchema: StructType = StructType(
StructField("concated", StringType) :: Nil)
// This is the output type of your aggregatation function.
override def dataType: DataType = StringType
override def deterministic: Boolean = true
// This is the initial value for your buffer schema.
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = "-1"
}
// This is how to update your buffer schema given an input.
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
buffer(0) = (buffer.getAs[String](0) + ",,," + input.getAs[String](0))
.stripPrefix("-1,,,")
}
// This is how to merge two objects with the bufferSchema type.
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = (buffer1.getAs[String](0) + ",,," + buffer2.getAs[String](0))
.stripPrefix("-1,,,")
.stripSuffix(",,,-1")
}
// This is where you output the final value, given the final value of your bufferSchema.
override def evaluate(buffer: Row): Any = {
buffer.getString(0)
}
}
注意:分隔符为',,,'。看似奇怪的“ -1”初始化和随后的stripPre / Subfix()是我无意间将 buffer 的初始值串联在一起的糟糕解决方案。
用法如下:
val udafConcatCol = new ConcatString
val dfWindowed = dfStream
.groupBy($"ip")
.agg(udafConcatCol(col("device")).as("devices")
....