我正在小型独立Spark群集上运行一些计算。如果我使用Python语法编写代码,那么如果我使用SQL编写它,它的执行速度要慢得多。我很好奇为什么会这样。我怀疑我的操作顺序可能会因版本而异吗? ...如果我使用Scala语法 - 这会让我的性能接近SQL吗?
的Python:
df = sc.read.parquet('/data/spark-parquet/table')
count_window = Window.partitionBy('col1', 'col2', 'col4', 'col5')
sum_window = Window.partitionBy('col1', 'col2')
df.select('col1', 'col2',
date_format(from_unixtime('col3'), 'u').alias('col4'),
date_format(from_unixtime('col3'), 'H').alias('col5'))
.withColumn('col6', functions.count('col2').over(count_window))
.withColumn('col7', functions.count('col2').over(sum_window))
.select('col1', 'col2', 'col4', 'col5', 'col6', 'col7')
.distinct()
.show()
SQL:
df = sc.read.parquet('/data/spark-parquet/table')
df.createOrReplaceTempView('data')
sc.sql("""
select distinct
x.col1, x.col2, x.col4, x.col5, x.col6, x.col7
from (select v.col1, v.col2, v.col4, v.col5,
count(1) over(partition by v.col1, v.col2, v.col4, v.col5) as col6,
sum(1) over(partition by v.col1, v.col2) as col7
from (select col1,
col2,
date_format(from_unixtime(col3), 'u') as col4,
date_format(from_unixtime(col3), 'H') as col5
from data) v) x
""").show()
Python版本在4.5分钟内运行,而SQL版本在55秒内运行。数据集为6700万行。镶木地板有约50个分区。
谢谢!