带有pyspark 2.4的熊猫udf

时间:2020-11-05 23:22:06

标签: pandas pyspark apache-spark-sql pyspark-dataframes

我正在尝试根据以下spark文档使用pyspark 2.4,pyarrow版本0.15.0和pandas版本0.24.2执行pandas_udf,在调用pandas_udf函数时出现问题。

https://spark.apache.org/docs/2.4.0/sql-pyspark-pandas-with-arrow.html

import pandas as pd
from pyspark.sql.functions import col, pandas_udf
from pyspark.sql.types import LongType

# Declare the function and create the UDF
def multiply_func(a, b):
    return a * b

multiply = pandas_udf(multiply_func, returnType=LongType())

# The function for a pandas_udf should be able to execute with local Pandas data
x = pd.Series([1, 2, 3])

# Create a Spark DataFrame, 'spark' is an existing SparkSession
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

# Execute function as a Spark vectorized UDF
df.select(multiply(col("x"), col("x"))).show()




Py4JJavaError: An error occurred while calling o64.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): java.lang.IllegalArgumentException
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
        at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)

spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled", "true")

I have tried to set the above variables, still seeing the above errors.Can anyone please help me to resolve the issue ?




 

1 个答案:

答案 0 :(得分:1)

您可以在ARROW_PRE_0_15_IPC_FORMAT=1中设置$SPARK_HOME/conf/spark-env.shfunction

中记录了此问题