我正在尝试按照此处https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.pandas_udf中的说明使用熊猫GROUPED_MAP udf。
.withWatermark("kafka_ts", "3 minutes") \
.groupBy(
window("kafka_ts", "3 minutes"),
"grouping_key"
) \
.apply(my_udf_func) \
这在运行时失败,并显示错误:
ERROR 2020-03-02 06:42:27,241 36498 org.apache.spark.util.Utils [Executor task launch worker for task 171] Aborting task
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 377, in main
process()
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 372, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 286, in dump_stream
for series in iterator:
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 304, in load_stream
yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 304, in <listcomp>
yield [self.arrow_to_pandas(c) for c in pa.Table.from_batches([batch]).itercolumns()]
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 274, in arrow_to_pandas
s = _check_series_convert_date(s, from_arrow_type(arrow_column.type))
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1672, in from_arrow_type
raise TypeError("Unsupported type in conversion from Arrow: " + str(at))
TypeError: Unsupported type in conversion from Arrow: struct<start: timestamp[us, tz=Etc/UTC], end: timestamp[us, tz=Etc/UTC]>
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:456)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:172)
at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:117)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
该错误似乎来自传递给箭头进行转换的窗口对象struct<start: timestamp[us, tz=Etc/UTC], end: timestamp[us, tz=Etc/UTC]>
,因为它不受支持。但是像下面那样禁用箭头似乎并不能禁用基于箭头的转换。
spark = SparkSession.builder \
.master('local[*]') \
.config('spark.executor.memory', '2g') \
.config('spark.driver.memory','8g') \
.config('spark.sql.execution.arrow.enabled', False) \
.getOrCreate()
# nor this
spark.conf.set("spark.sql.execution.arrow.enabled", False)
我正在使用jupyter笔记本环境。有人知道这里发生了什么吗?