Question

我将数据流式传输到Google PubSub中的一个主题中。我可以使用简单的Python代码查看该数据：

...
def callback(message):
    print(datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f") + ": message = '" + message.data + "'")
    message.ack()

future = subscriber.subscribe(subscription_name, callback)
future.result()

上面的python代码从Google PubSub主题（具有订阅者 subscriber_name ）接收数据，并按预期将其写入终端。我想将相同的数据从主题流式传输到PySpark（RDD或数据帧）中，因此我可以进行其他流式转换，例如PySpark中的窗口和聚合，如此处所述：https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html。

该链接包含用于阅读其他流媒体来源（例如Kafka）的文档，但不包含Google PubSub。是否可以从Google PubSub流式传输到PySpark？

Answer 1

您可以使用Apache Beam：https://beam.apache.org/

Apache Beam具有Pyhton对Cloud Pub / Sub的支持：https://beam.apache.org/documentation/io/built-in/

有一个Python SDK：https://beam.apache.org/documentation/sdks/python/

并支持Spark：https://beam.apache.org/documentation/runners/capability-matrix/

Answer 2

您可以使用 Apache Bahir ，它提供了Apache Spark的扩展，包括Google Cloud Pub / Sub的连接器。

您会发现an example from Google Cloud Platform，在Kubernetes上使用Spark计算从Google Cloud PubSub主题接收的数据流中的字数，并将结果写入Google Cloud Storage（GCS）存储桶。

有another example使用 DStream 在Cloud Dataproc上部署Apache Spark流应用程序并处理来自Cloud Pub / Sub的消息。

如何将数据从Google PubSub主题流式传输到PySpark（在Google Cloud上）

2 个答案: