我想获取流源(Kafka)记录的数量并将其发送到某些接收器
我有这个代码
event_stream.printSchema()
val columnNames = Seq("timestamp", "topic", "value")
var counter = event_stream
.select(columnNames.head, columnNames.tail: _*)
.withWatermark("timestamp", "5 minutes")
.groupBy(
window($"timestamp", "10 minutes", "5 minutes"),
$"value")
.count
.drop("value")
.drop("window")
.withColumnRenamed("count", "kafka.count")
.withColumnRenamed("topic", "kafka.topic")
counter.printSchema
这是我得到的输出
root
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
root
|-- kafka.count: long (nullable = false)
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [kafka.count];;
'EventTimeWatermark 'timestamp, interval 2 minutes
+- AnalysisBarrier
+- Project [count#30L AS kafka.count#38L]
+- Project [count#30L]
+- Project [window#25-T300000ms, count#30L]
+- Aggregate [window#31-T300000ms, value#8], [window#31-T300000ms AS window#25-T300000ms, value#8, count(1) AS count#30L]
+- Filter ((timestamp#12-T300000ms >= window#31-T300000ms.start) && (timestamp#12-T300000ms < window#31-T300000ms.end))
+- Expand [List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(0 as bigint)) - cast(2 as bigint)) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion((((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(0 as bigint)) - cast(2 as bigint)) * 300000000) + 0) + 600000000), LongType, TimestampType)), timestamp#12-T300000ms, topic#9, value#8), List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(1 as bigint)) - cast(2 as bigint)) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion((((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(1 as bigint)) - cast(2 as bigint)) * 300000000) + 0) + 600000000), LongType, TimestampType)), timestamp#12-T300000ms, topic#9, value#8)], [window#31-T300000ms, timestamp#12-T300000ms, topic#9, value#8]
+- EventTimeWatermark timestamp#12: timestamp, interval 5 minutes
+- Project [timestamp#12, topic#9, value#8]
+- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@5edc3e29, kafka, Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@89caf47,kafka,List(),None,List(),None,Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]
我输入了select命令,因为它没有在模式中显示时间戳和主题。而且它仍然不是。不知道为什么。 我有几个问题。
谢谢