无法使用Spark结构化流计算流数据帧的计数

时间:2018-07-09 16:07:52

标签: apache-spark dataframe apache-kafka spark-structured-streaming

我想获取流源(Kafka)记录的数量并将其发送到某些接收器

我有这个代码

event_stream.printSchema()

val columnNames = Seq("timestamp", "topic", "value")
var counter = event_stream
  .select(columnNames.head, columnNames.tail: _*)
  .withWatermark("timestamp", "5 minutes")
  .groupBy(
   window($"timestamp", "10 minutes", "5 minutes"),
   $"value")
  .count
  .drop("value")
  .drop("window")
  .withColumnRenamed("count", "kafka.count")
  .withColumnRenamed("topic", "kafka.topic")

counter.printSchema

这是我得到的输出

root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

root
 |-- kafka.count: long (nullable = false)
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`timestamp`' given input columns: [kafka.count];;
'EventTimeWatermark 'timestamp, interval 2 minutes
+- AnalysisBarrier
      +- Project [count#30L AS kafka.count#38L]
         +- Project [count#30L]
            +- Project [window#25-T300000ms, count#30L]
               +- Aggregate [window#31-T300000ms, value#8], [window#31-T300000ms AS window#25-T300000ms, value#8, count(1) AS count#30L]
                  +- Filter ((timestamp#12-T300000ms >= window#31-T300000ms.start) && (timestamp#12-T300000ms < window#31-T300000ms.end))
                     +- Expand [List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(0 as bigint)) - cast(2 as bigint)) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion((((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(0 as bigint)) - cast(2 as bigint)) * 300000000) + 0) + 600000000), LongType, TimestampType)), timestamp#12-T300000ms, topic#9, value#8), List(named_struct(start, precisetimestampconversion(((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(1 as bigint)) - cast(2 as bigint)) * 300000000) + 0), LongType, TimestampType), end, precisetimestampconversion((((((CASE WHEN (cast(CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) as double) = (cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) THEN (CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) + cast(1 as bigint)) ELSE CEIL((cast((precisetimestampconversion(timestamp#12-T300000ms, TimestampType, LongType) - 0) as double) / cast(300000000 as double))) END + cast(1 as bigint)) - cast(2 as bigint)) * 300000000) + 0) + 600000000), LongType, TimestampType)), timestamp#12-T300000ms, topic#9, value#8)], [window#31-T300000ms, timestamp#12-T300000ms, topic#9, value#8]
                        +- EventTimeWatermark timestamp#12: timestamp, interval 5 minutes
                           +- Project [timestamp#12, topic#9, value#8]
                              +- StreamingRelationV2 org.apache.spark.sql.kafka010.KafkaSourceProvider@5edc3e29, kafka, Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092), [key#7, value#8, topic#9, partition#10, offset#11L, timestamp#12, timestampType#13], StreamingRelation DataSource(org.apache.spark.sql.SparkSession@89caf47,kafka,List(),None,List(),None,Map(startingOffsets -> latest, failOnDataLoss -> false, subscribe -> events-identification-carrier, kafka.bootstrap.servers -> svc-kafka-pre-c1-01.jamba.net:9092),None), kafka, [key#0, value#1, topic#2, partition#3, offset#4L, timestamp#5, timestampType#6]

我输入了select命令,因为它没有在模式中显示时间戳和主题。而且它仍然不是。不知道为什么。 我有几个问题。

  1. 显然为什么它不起作用?我哪里出问题了?
  2. 为什么在第二个模式主题中未显示?
  3. 是否需要使用窗口来获取计数?

谢谢

0 个答案:

没有答案