我对过滤熊猫和pyspark数据帧时的时差有疑问:
import time
import numpy as np
import pandas as pd
from random import shuffle
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame(np.random.randint(1000000, size=400000).reshape(-1, 2))
list_filter = list(range(10000))
shuffle(list_filter)
# pandas is fast
t0 = time.time()
df_filtered = df[df[0].isin(list_filter)]
print(time.time() - t0)
# 0.0072
df_spark = spark.createDataFrame(df)
# pyspark is slow
t0 = time.time()
df_spark_filtered = df_spark[df_spark[0].isin(list_filter)]
print(time.time() - t0)
# 3.1232
如果我将list_filter
的长度增加到10000,则执行时间为0.01353和17.6768秒。 isin seems的Pandas实现具有较高的计算效率。您能解释一下为什么对pyspark数据帧进行过滤的速度如此之慢,如何快速执行这种过滤吗?
答案 0 :(得分:3)
您需要使用join代替带isin子句的filter来加速pyspark中的filter操作:
import time
import numpy as np
import pandas as pd
from random import shuffle
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame(np.random.randint(1000000, size=400000).reshape(-1, 2))
df_spark = spark.createDataFrame(df)
list_filter = list(range(10000))
list_filter_df = spark.createDataFrame([[x] for x in list_filter], df_spark.columns[:1])
shuffle(list_filter)
# pandas is fast because everything in memory
t0 = time.time()
df_filtered = df[df[0].isin(list_filter)]
print(time.time() - t0)
# 0.0227580165863
# 0.0127580165863
# pyspark is slow because there is memory overhead, but broadcast make is mast compared to isin with lists
t0 = time.time()
df_spark_filtered = df_spark.join(F.broadcast(list_filter_df), df_spark.columns[:1])
print(time.time() - t0)
# 0.0571971035004
# 0.0471971035004
答案 1 :(得分:1)
Spark旨在用于海量数据。如果数据适合大熊猫数据框,大熊猫反而会更快。问题是,对于海量数据,熊猫将失败,而星火将完成任务(例如,比MapReduce更快)。
在这种情况下,Spark通常会变慢,因为它需要开发要执行的操作的DAG,例如执行计划,以尝试对其进行优化。
因此,您应该只在数据非常大时才考虑使用spark,否则使用pandas会更快。
您可以检查this article并查看大熊猫和火花速度之间的比较,大熊猫总是更快,直到数据太大而失败为止。