我正在尝试使Apache Arrow能够转换为熊猫。我正在使用:
pyspark 2.4.4 罂粟0.15.0 熊猫0.25.1 numpy 1.17.2
这是示例代码
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
x = pd.Series([1, 2, 3])
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
我收到了此警告消息
c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile.
: java.lang.IllegalArgumentException
at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.<init>(ArrowConverters.scala:229)
at org.apache.spark.sql.execution.arrow.ArrowConverters$.getBatchesFromStream(ArrowConverters.scala:228)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:216)
at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:214)
at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
at org.apache.spark.sql.execution.arrow.ArrowConverters$.readArrowStreamFromFile(ArrowConverters.scala:214)
at org.apache.spark.sql.api.python.PythonSQLUtils$.readArrowStreamFromFile(PythonSQLUtils.scala:46)
at org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile(PythonSQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
warnings.warn(msg)
答案 0 :(得分:3)
我们在0.15.0中进行了更改,使pyarrow的默认行为与Java中的较旧版本的Arrow不兼容-您的Spark环境似乎正在使用较旧的版本。
您的选择是
ARROW_PRE_0_15_IPC_FORMAT=1
答案 1 :(得分:0)
用于使用pyarrow==0.15
在Spark 2.4.4集群中调用我的熊猫UDF。我为成功设置如上所述的ARROW_PRE_0_15_IPC_FORMAT=1
标志而感到困惑。
我在头节点上通过export
在命令行中设置标记,在群集中所有节点上通过spark-env.sh
和yarn-env.sh
在命令行中设置标记,并且(3) )在头节点上我脚本中的pyspark代码本身中。由于未知原因,这些方法都无法在udf中实际设置此标志。
我找到的最简单的解决方案是在udf内部称为 :
@pandas_udf("integer", PandasUDFType.SCALAR)
def foo(*args):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
#...
希望这可以为其他人节省几个小时。