pyspark数据帧的慢过滤

时间:2018-12-12 07:24:22

标签: python pandas pyspark pyspark-sql

我对过滤熊猫和pyspark数据帧时的时差有疑问:

import time
import numpy as np
import pandas as pd
from random import shuffle

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame(np.random.randint(1000000, size=400000).reshape(-1, 2))
list_filter = list(range(10000))
shuffle(list_filter)

# pandas is fast 
t0 = time.time()
df_filtered = df[df[0].isin(list_filter)]
print(time.time() - t0)
# 0.0072

df_spark = spark.createDataFrame(df)

# pyspark is slow
t0 = time.time()
df_spark_filtered = df_spark[df_spark[0].isin(list_filter)]
print(time.time() - t0)
# 3.1232

如果我将list_filter的长度增加到10000,则执行时间为0.01353和17.6768秒。 isin seems的Pandas实现具有较高的计算效率。您能解释一下为什么对pyspark数据帧进行过滤的速度如此之慢,如何快速执行这种过滤吗?

2 个答案:

答案 0 :(得分:3)

您需要使用join代替带isin子句的filter来加速pyspark中的filter操作:

import time
import numpy as np
import pandas as pd
from random import shuffle
import pyspark.sql.functions as F

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame(np.random.randint(1000000, size=400000).reshape(-1, 2))

df_spark = spark.createDataFrame(df)

list_filter = list(range(10000))
list_filter_df = spark.createDataFrame([[x] for x in list_filter], df_spark.columns[:1])
shuffle(list_filter)

# pandas is fast because everything in memory
t0 = time.time()
df_filtered = df[df[0].isin(list_filter)]
print(time.time() - t0)
# 0.0227580165863
# 0.0127580165863

# pyspark is slow because there is memory overhead, but broadcast make is mast compared to isin with lists
t0 = time.time()
df_spark_filtered = df_spark.join(F.broadcast(list_filter_df), df_spark.columns[:1])
print(time.time() - t0)
# 0.0571971035004
# 0.0471971035004

答案 1 :(得分:1)

Spark旨在用于海量数据。如果数据适合大熊猫数据框,大熊猫反而会更快。问题是,对于海量数据,熊猫将失败,而星火将完成任务(例如,比MapReduce更快)。

在这种情况下,Spark通常会变慢,因为它需要开发要执行的操作的DAG,例如执行计划,以尝试对其进行优化。

因此,您应该只在数据非常大时才考虑使用spark,否则使用pandas会更快。

您可以检查this article并查看大熊猫和火花速度之间的比较,大熊猫总是更快,直到数据太大而失败为止。