加入DataFrames,在Spark 1.5.2上的物理计划中创建CartesianProduct

时间:2015-12-18 18:32:09

标签: join apache-spark dataframe apache-spark-sql cartesian

使用spark-avro库加入从avro文件创建的数据框时,我遇到性能问题。

数据框是从120K avro文件创建的,总大小约为1.5 TB。 这两个数据框非常庞大,有数十亿条记录。

这两个DataFrame的连接永远运行。 此过程在具有300个执行器和4个执行器核心和8GB内存的纱线群集上运行。

对此次加入的任何见解都会有所帮助。我在下面发布了解释计划。 我注意到物理计划中有CartesianProduct。我想知道这是否会导致性能问题。

以下是逻辑计划和实际计划。 (由于机密性质,我无法在此发布任何列名或文件名)

  == Optimized Logical Plan ==
Limit 21
 Join Inner, [ Join Conditions ]
  Join Inner, [ Join Conditions ]
   Project [ List of columns ]
    Relation [ List of columns ] AvroRelation[ fileName1 ] -- large file - .5 billion records
   InMemoryRelation  [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None
  Project [ List of Columns ]
   Relation[ List of Columns] AvroRelation[ filename2 ] -- another large file - 800 million records

== Physical Plan ==
Limit 21
 Filter (filter conditions)
  CartesianProduct
   Filter (more filter conditions)
    CartesianProduct
     Project (selecting a few columns and applying a UDF to one column)
      Scan AvroRelation[avro file][ columns in Avro File ]
     InMemoryColumnarTableScan [List of columns ], true, 10000, StorageLevel(true, true, false, true, 1), (Repartition 1, false), None)
   Project [ List of Columns ]
    Scan AvroRelation[Avro File][List of Columns]

Code Generation: true

代码如下所示。

val customerDateFormat = new SimpleDateFormat(" yyyy / MM / dd");

val dates = new RetailDates()
val dataStructures = new DataStructures()

// Reading CSV Format input files -- retailDates
// This DF has 75 records
val retailDatesWithSchema = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("delimiter", ",")
  .schema(dates.retailDatesSchema)
  .load(datesFile)
  .coalesce(1)
  .cache()

// Create UDF to convert String to Date
val dateUDF: (String => java.sql.Date) = (dateString: String) => new java.sql.Date(customerDateFormat.parse(dateString).getTime())
val stringToDateUDF = udf(dateUDF)

// Reading Avro Format Input Files
// This DF has 500 million records
val userInputDf = sqlContext.read.avro(“customerLocation")
val userDf = userInputDf.withColumn("CAL_DT", stringToDateUDF(col("CAL_DT"))).select(
                      "CAL_DT","USER_ID","USER_CNTRY_ID"
                    )

val userDimDf = sqlContext.read.avro(userDimFiles).select("USER_ID","USER_CNTRY_ID","PRIMARY_USER_ID") // This DF has 800 million records

val retailDatesWithSchemaBroadcast = sc.broadcast(retailDatesWithSchema)
val userDimDfBroadcast = sc.broadcast(userDimDf)

val userAndRetailDates = userDnaSdDf
  .join((retailDatesWithSchemaBroadcast.value).as("retailDates"),
  userDf("CAL_DT") between($"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE")
  , "inner")



val userAndRetailDatesAndUserDim = userAndRetailDates
  .join((userDimDfBroadcast.value)
    .withColumnRenamed("USER_ID", "USER_DIM_USER_ID")
    .withColumnRenamed("USER_CNTRY_ID","USER_DIM_COUNTRY_ID")
    .as("userdim")
    , userAndRetailDates("USER_ID") <=> $"userdim.USER_DIM_USER_ID"
      && userAndRetailDates("USER_CNTRY_ID") <=> $"userdim.USER_DIM_COUNTRY_ID"
    , "inner")

userAndRetailDatesAndUserDim.show()

谢谢, 普拉萨德。

1 个答案:

答案 0 :(得分:0)

此处没有太多内容(即使您的数据甚至列/表名称是保密的,查看一些可以显示您尝试实现的内容的代码也很有用)但CartesianProduct肯定是一个问题。 O(N ^ 2)是你在大型数据集中真正想要避免的东西,在这种特殊情况下它会击中Spark中的所有弱点。

一般来说,如果将连接扩展为显式笛卡尔积或等效操作,则意味着连接表达式不基于相等性,因此无法使用基于shuffle(或广播+散列)的连接进行优化(SortMergeJoin,{ {1}})。

修改

在您的情况下,下列情况很可能是问题:

HashJoin

最好在userDf("CAL_DT") between( $"retailDates.WEEK_BEGIN_DATE", $"retailDates.WEEK_END_DATE") 上计算WEEK_BEGIN_DATE并直接加入

userDf

另一个小改进是解析日期而不使用UDF,例如使用$"userDf.WEEK_BEGIN_DATE" === $"retailDates.WEEK_BEGIN_DATE" 函数。

修改

rchukh指出的另一个问题是,Spark&lt; = 1.6中的unix_timestamp已扩展为笛卡尔积 - SPARK-11111