Question

我有一个处理网站点击事件流的Spark Streaming应用程序。每个事件都有一个包含GUID的属性，该GUID标识事件所属的用户会话。

我的应用程序正在使用窗口计算每个会话发生的事件数：

def countEvents(kafkaStream: DStream[(String, Event)]): DStream[(String, Session)] = {

  // Get a list of the session GUIDs from the events
  val sessionGuids = kafkaStream
    .map(_._2)
    .map(_.getSessionGuid)

  // Count up the GUIDs over our sliding window
  val sessionGuidCountsInWindow = sessionGuids.countByValueAndWindow(Seconds(60), Seconds(1))

  // Create new session objects with the event count
  sessionGuidCountsInWindow
    .map({
      case (guidS, eventCount) =>
        guidS -> new Session().setGuid(guidS).setEventCount(eventCount)
  })
}

我的理解是countByValueAndWindow函数只计算调用函数的DStream中的值。换句话说，在上面的代码中，对countByValueAndWindow的调用应仅返回我们调用该函数的sessionGuids DStream中的会话GUID的事件计数。

但我观察的是不同的东西;对countByValueAndWindow的调用返回不在sessionGUID中的会话GUID的计数。它似乎是在先前批次中处理的会话GUID的返回计数。我只是误解了这个功能是如何工作的？我无法在网上找到任何有用的文档。

Answer 1

我的一位同事比我更熟悉Spark的同事帮助了我。显然我误解了countByValueAndWindow函数的工作方式。我认为它只会返回您正在调用该函数的DStream中的值的计数。但实际上，它会返回整个窗口中所有值的计数。为了解决我的问题，我只是在输入DStream和countByValueAndWindow操作产生的DStream之间执行连接。因此，我只得到输入DStream中值的结果。

Spark Streaming的countByValueAndWindow如何工作？

1 个答案: