Question

在尝试使用箭头功能将pyspark数据帧转换为pandas数据帧时，只有一半的行被转换。 Pyspark df包含170,000行。

>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>> result_pdf returns only 65000 rows.

我尝试使用以下命令安装和更新pyarrow：

>> conda install -c conda-forge pyarrow
>> pip install pyarrow
>> pip install pyspark[sql]

然后运行

>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()
>>spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>> result_pdf = train_set.select("*").toPandas()

每次转换时，我都会收到以下警告消息：

C：\ Users \ MUM1342.conda \ envs \ snakes \ lib \ site-packages \ pyarrow__init __。py：152：用户警告：pyarrow.open_stream已过时，请使用 pyarrow.ipc.open_stream warnings.warn（“ pyarrow.open_stream为不推荐使用，请使用“ C：\ Users \ MUM1342.conda \ envs \ snakes \ lib \ site-packages \ pyspark \ sql \ dataframe.py：2138： UserWarning：toPandas尝试了箭头优化，因为 'spark.sql.execution.arrow.enabled'设置为true，但已达到以下错误，无法继续。注意 'spark.sql.execution.arrow.fallback.enabled'不起作用计算中的故障。

实际输出：

> train_set.count
> 170256
> result_pdf.shape
> 6500

预期输出：

> train_set.count
> 170256
> result_pdf.shape
> 170256

Answer 1

请尝试以下操作

启用基于箭头的列式数据传输

spark.conf.set("spark.sql.execution.arrow.enabled", "true")

Pyspark Dataframe在Pyspark中使用toPandas或Pyarrow函数转换为熊猫时不返回所有行

1 个答案: