我有一个Spark应用程序,可以对11个表进行联接。基本上是通过与所有维表联接来对事实表进行非规范化。加入发生在Spark上。所有表都位于TiDB中。作业使用jdbc连接进行连接
当前,批处理只有15分钟,实际上表中的行数约为10,000至15,000。是否有用于连接的任何调优参数。任何可以优化的代码。有一个更好的方法吗?
代码段
val factTable = sparkSession.sql("select col1,col2,col3... from fact_table where last_modified_time between lowerBound and higerbound")
//Get only the rows required from a dimension tables by generating a where clause
//This generates dim1_id=122 OR dim1_id=123 OR dim1_id=124 OR ...
val dim1TableFilter = factTable.map(fact => s"dim1_id = ${fact.dim1_id}").dropDuplicates().reduce(_+" OR "+_)
val dim1Table = sparkSession.sql(s"select col1,col2,col3.... from dim1Table where ${dim1TableFilter}")
val dim2TableFilter = factTable.map(fact => s"dim2_id = ${fact.dim2_id}").dropDuplicates().reduce(_+" OR "+_)
val dim2Table = sparkSession.sql(s"select col1,col2,col3.... from dim2Table where ${dim2TableFilter}")
val dim3TableFilter = factTable.map(fact => s"dim3_id = ${fact.dim3_id}").dropDuplicates().reduce(_+" OR "+_)
val dim3Table = sparkSession.sql(s"select col1,col2,col3.... from dim3Table where ${dim3TableFilter}")
...
....
...... so on
// Finally join fact tables with dimension tables
val denormalisedTable = factTable.join(dim1Table,Seq("dim1_id"))
.join(dim2Table,Seq("dim2_id"))
.join(dim3Table,Seq("dim3_id"))
.join(dim4Table,Seq("di4_id"))
.join(dim5Table,Seq("dim5_id"))
.join(dim6Table,Seq("dim6_id"))
.join(dim7Table,Seq("dim7_id"))
.join(dim8Table,Seq("dim8_id"))
.join(dim9Table,Seq("dim9_id"))
.join(dim10Table,Seq("dim10_id"))
.join(dim11Table,Seq("dim11_id"))
// Push the batch to Kafka
denormalisedTable
.select(to_json(keyColumns).as("key"), to_json(struct(col1,col2,col3...)).as("value"), current_timestamp().as("timestamp"))
.selectExpr("CAST(key as STRING)", "CAST(value as STRING)", "CAST(timestamp as LONG)")
.write
.format("kafka")
.options(PropertiesParser.getKafkaConf())
.option("topic", topicName)
.save()
答案 0 :(得分:1)
您可以评估的一件事是尝试使用地图侧连接。对于较大的表(事实)与较小的表(维度)之间的联接,地图侧联接可能非常有用,然后这些联接可用于执行星型联接。从本质上讲,这避免了通过常规哈希连接在网络上发送大表!
引用:https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html