在 pyspark 3.0.1 版中解析来自 kafka 的 avro 流数据时出错

时间:2021-02-20 00:01:16

标签: pyspark apache-kafka avro

我正在尝试使用来自 kafka 的 avro 消息并在追加模式下将它们处理到增量表中。

我在 google dataproc 1.5 版上使用 pyspark 3.0.1 版和 delta 版 0.6.0

代码是消费来自kafak主题的数据是:

MYDIR = os.path.dirname("<directory>")
jsonFormatSchema = open(os.path.join(MYDIR, 'schema_file.avsc'), "r").read()


df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "<broker_ip>") \
  .option("subscribe", "<<topic_name>>") \
  .option("startingOffsets", "earliest") \
  .load() \
  .select(from_avro("value", jsonFormatSchema).alias("element"))

我收到的错误消息是:

Traceback (most recent call last):
  File "<stdin>", line 8, in <module>
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/dataframe.py", line 1421, in select
    jdf = self._jdf.select(self._jcols(*cols))
  File "/usr/local/lib/python2.7/dist-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/local/lib/python2.7/dist-packages/pyspark/sql/utils.py", line 128, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python2.7/dist-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: <exception str() failed>

0 个答案:

没有答案
相关问题