如何在Pyspark中启用Apache Arrow

时间:2019-10-07 11:58:13

标签: pandas pyspark pyarrow

我正在尝试使Apache Arrow能够转换为熊猫。我正在使用:

pyspark 2.4.4 罂粟0.15.0 熊猫0.25.1 numpy 1.17.2

这是示例代码

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
x = pd.Series([1, 2, 3])
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))

我收到了此警告消息

c:\users\administrator\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\session.py:714: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true; however, failed by the reason below:
  An error occurred while calling z:org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile.
: java.lang.IllegalArgumentException
    at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
    at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.<init>(ArrowConverters.scala:229)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.getBatchesFromStream(ArrowConverters.scala:228)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:216)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$$anonfun$readArrowStreamFromFile$2.apply(ArrowConverters.scala:214)
    at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2543)
    at org.apache.spark.sql.execution.arrow.ArrowConverters$.readArrowStreamFromFile(ArrowConverters.scala:214)
    at org.apache.spark.sql.api.python.PythonSQLUtils$.readArrowStreamFromFile(PythonSQLUtils.scala:46)
    at org.apache.spark.sql.api.python.PythonSQLUtils.readArrowStreamFromFile(PythonSQLUtils.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.
  warnings.warn(msg)

2 个答案:

答案 0 :(得分:3)

我们在0.15.0中进行了更改,使pyarrow的默认行为与Java中的较旧版本的Arrow不兼容-您的Spark环境似乎正在使用较旧的版本。

您的选择是

  • 从使用Python的位置设置环境变量ARROW_PRE_0_15_IPC_FORMAT=1
  • 现在降级到pyarrow <0.15.0。

答案 1 :(得分:0)

用于使用pyarrow==0.15在Spark 2.4.4集群中调用我的熊猫UDF。我为成功设置如上所述的ARROW_PRE_0_15_IPC_FORMAT=1标志而感到困惑。

我在头节点上通过export在命令行中设置标记,在群集中所有节点上通过spark-env.shyarn-env.sh在命令行中设置标记,并且(3) )在头节点上我脚本中的pyspark代码本身中。由于未知原因,这些方法都无法在udf中实际设置此标志。

我找到的最简单的解决方案是在udf内部称为

    @pandas_udf("integer", PandasUDFType.SCALAR)
    def foo(*args):
        import os
        os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
        #...

希望这可以为其他人节省几个小时。