Question

我的目标是从流中读取流数据（在我的案例中是aws kinesis），然后查询数据。问题是我想查询每个批处理间隔的最后5分钟数据。我发现可以将数据保存在流中一段时间（使用StreamingContext.remember（持续时间持续时间）方法）。 Zeppelin的spark解释器自动创建SparkSession，我不知道如何配置StreamingContext。这就是我的所作所为：

val df = spark
  .readStream
  .format("kinesis")
  .option("streams", "test")
  .option("endpointUrl", "kinesis.us-west-2.amazonaws.com")
  .option("initialPositionInStream", "latest")
  .option("format", "csv")
  .schema(//schema definition)
  .load

到目前为止一切顺利。然后，据我所知，在设置和启动写入流时启动流式上下文：

df.writeStream
  .format(//output source)
  .outputMode("complete")
  .start()

但是只有SparkSession，我不知道如何在最后X分钟数据上实现查询。有什么建议吗？

在Apache Zeppelin中配置StreamingContext

0 个答案: