优化Spark DataFrame分区

时间:2019-04-30 11:42:23

标签: apache-spark memory-management pyspark partitioning

在这篇文章之后,我还有其他问题:https://stackoverflow.com/a/39398750/5060792

我有一个4节点集群(1个驱动程序,3个工作程序),每个工作程序有16个核心和62gb的ram,而驱动程序有8个核心和12gb的ram。

因此,按照分区的“经验法则”,分区应为(工作程序节点数*每个工作程序节点的执行程序数*每个执行程序的核心数)* 3或4。通过动态分配,我不确定到底有多少执行程序是在每个节点上启动的,但是假设每个工作节点有3个执行者,每个执行者5个核心,则应为:3 * 3 * 5 * 4 =180。那么180个分区应该接近最优吗?

给出下面的可复制代码(其中df是一个125,000行数据帧,其中String列为'text')。通过动态分配,spark将导入的数据帧放在群集的一个分区中。

count()中的df.repartition(180)之前大约需要8到10秒,在之后大约1到2秒。 而.rdd函数addArrays.repartition(180)之前大约需要8到10秒,在之后大约需要150秒。
注意:我一直使用spark 2.2.0,因此spark sql数组函数对我不可用。

此后运行.repartition(1)并不能加快addArrays的速度,它持续大约需要2.5分钟。但是,再次从头开始重建ngrams df,spark将所有内容都放在一个分区中,将其恢复到仅几秒钟的速度。

简而言之:count()变快,.rdd.map()变慢。

我可以多次重复这些情况。在应用任何功能之前或之后重新分区不会改变计时的任何可观的数量。

import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.ml.feature import NGram
from pyspark.ml import Pipeline

spark = (
    SparkSession.builder.master('yarn').appName('test')
    .config('spark.kryoserializer.buffer.max', '1g')
    .config('spark.sql.cbo.enabled', True)
    .config('spark.sql.cbo.joinReorder.enabled', True)
    .config('spark.yarn.executor.memoryOverhead', '2g')
    .config('spark.driver.maxResultSize', '2g')
    .config("spark.port.maxRetries", 100)
    .config('spark.dynamicAllocation.enabled', 'true')
    .config('spark.dynamicAllocation.executorIdleTimeout', '60')
    .config('spark.dynamicAllocation.maxExecutors', '56')
    .config('spark.dynamicAllocation.minExecutors', '0')
    .config('spark.dynamicAllocation.schedulerBacklogTimeout', '1')
    .getOrCreate()
)

sc = spark.sparkContext

sc.defaultParallelism
## my defaultParallelism is 2

placeholder = (
    r"Lorem ipsum dolor sit amet, consectetur adipiscing elit, "
    r"sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. "
    r"Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris "
    r"nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in "
    r"reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla "
    r"pariatur. Excepteur sint occaecat cupidatat non proident, sunt in "
    r"culpa qui officia deserunt mollit anim id est laborum."
)

df = (
    spark.range(0, 250000, 1)
    .withColumn('rand1', (F.rand(seed=12345) * 50).cast(T.IntegerType()))
    .withColumn('text', F.lit(placeholder))
    .withColumn('text', F.expr("substring_index(text, ' ', rand1)" ))
    .withColumn('text', F.split(F.col('text'), ' '))
    .select('text')
)

## Saving and reloading puts into 1 partition on my cluster.
df.write.parquet("df.parquet", mode='overwrite')
df = spark.read.parquet("df.parquet")

!hdfs dfs -du -h
## 1.4 M    4.3 M    df.parquet

ngram01 = NGram(n=1, inputCol="text", outputCol="ngrams01")
ngram02 = NGram(n=2, inputCol="text", outputCol="ngrams02")
ngram03 = NGram(n=3, inputCol="text", outputCol="ngrams03")
ngram04 = NGram(n=4, inputCol="text", outputCol="ngrams04")
ngram05 = NGram(n=5, inputCol="text", outputCol="ngrams05")

ngram_pipeline = (
    Pipeline()
    .setStages([ngram01, ngram02, ngram03, ngram04, ngram05])
)

ngrams = (
    ngram_pipeline
    .fit(df)
    .transform(df)
)

'''RDD Function to combine single-ngram Arrays.'''
colsNotNGrams = [c for c in ngrams.columns if 'ngrams' not in c]
colsNotNGramsTpls = ['(row.{},)'.format(c) for c in ngrams.columns if 'ngrams' not in c]
rddColTupls = ' + '.join(colsNotNGramsTpls)

def addArrays(row):
    return (
        eval( rddColTupls )
        + (row.ngrams01 + row.ngrams02 + row.ngrams03,) 
        + (row.ngrams01 + row.ngrams02 + row.ngrams03 + row.ngrams04 + row.ngrams05,)
    ) 


''' timings before repartitioning '''
ngrams.rdd.getNumPartitions()
# output is 1

ngrams.count()
# takes 8 to 10 seconds

ngrams2 = (
    ngrams
    .rdd.map(addArrays)
    .toDF(colsNotNGrams + ['ngrams_1to3', 'ngrams_1to5'])
)
## takes 8 to 10 seconds

''' timings after repartitioning '''
ngrams = ngrams.repartition(180)
ngrams.rdd.getNumPartitions()
# output is 180

ngrams2 = (
    ngrams
    .rdd.map(addArrays)
    .toDF(colsNotNGrams + ['ngrams_1to3', 'ngrams_1to5'])
)
## now takes 2.5 minutes 

## HOWEVER,
ngrams.count()
# now takes 1 to 2 seconds

''' timings after repartitioning again does not help '''
ngrams = ngrams.repartition(1)
ngrams.rdd.getNumPartitions()
# output is 1

ngrams2 = (
    ngrams
    .rdd.map(addArrays)
    .toDF(colsNotNGrams + ['ngrams_1to3', 'ngrams_1to5'])
)
## still takes 2.5 minutes 

0 个答案:

没有答案