Spark Dataframe native performance vs Pyspark RDD map on simple string split operation

时间:2016-08-31 17:06:06

标签: performance apache-spark pyspark apache-spark-sql spark-dataframe

I don't expect the following code to benefit from the Dataframe Catalyst query optimizer, but I do expect there to be a performance difference between the Scala/native performance of string split and the Python performance. However, my performance results are disappointing, as the native Dataframe API appears to be slower.

My test is as follows:

def get_df(spark):
                           inferSchema='true', header='true')

def upsize_df(df, exponent=10):
    for i in range(exponent):
        df = df.unionAll(df)
    return df

def rdd_ver(df):
    df = row: row + tuple(
                            df.columns + ['psrid', 'eoid'])

def df_ver(df):
    split_col = pyspark.sql.functions.split(df['order_id'], '-')
    df = df.withColumn('psrid', split_col.getItem(0))
    df = df.withColumn('eoid', split_col.getItem(1))

Cluster/YARN details:

  • Spark 2.0 on AWS
  • 6 executors
  • 2 cores per executor

Test procedure:

  • Create new PySpark shell in IPython
  • Get dataframe of toy-sized dataset (1000 rows)
  • repartition Dataframe to 12 partitions
  • upsize_df with unionAll, to get to 1 million rows
  • run df.count() to force execution of repartition and upsize_df
  • finally, run %time rdd_ver(df) or %time df_ver(df)

My results so far have been surprising and disappointing. Here is a sampling of the results I've received, in seconds:

rdd_ver: 14.5, 22.4, 13.1, 24.7, 17.8 --- mean: 18.5

df_ver: 30.5, 26.9, 32.0, 29.7, 39.8 --- mean: 31.8

I'd appreciate any thoughts, either on the test procedure itself (the operation itself is derived from some production code) or on the poor performance of the Dataframe version.


The Spark Web UI indicates that my jobs are not actually being scheduled/submitted very quickly. I am not sure how reliable the Web UI's information is, but the 'Submitted' time displayed on the active job in this screenshot is over a minute after I initially hit 'enter' in the active Pyspark session to kick off %time df_ver(df)

Active Spark Jobs

Furthermore, it seems that none of the 6 executors are doing anything. They've all apparently been killed by Spark since I wasn't actively doing anything in the Spark session for more than a few seconds. It seems like the entire job is being run by the driver node, but I can't confirm that since I don't know the Spark Web UI well enough.

enter image description here

1 个答案:

答案 0 :(得分:0)

为什么你认为scala应该更快? Python字符串操作非常快:


In [58]: %time "this is my string".split()
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 7.87 µs


bash-3.2$ echo '
object TimeSplit {
   def main(args: Array[String]): Unit = {
     val now = System.nanoTime
     val split = "this is my string".split(" ")
     val diff = System.nanoTime - now
     println("%d microseconds".format(diff/1000))
 }' > timesplit.scala

bash-3.2$ scalac timesplit.scala
bash-3.2$ scala TimeSplit
21 microseconds