Spark查询计划器随机生成不同的查询

时间:2019-01-17 06:52:20

标签: apache-spark apache-spark-sql

我有一个Spark作业,该作业从SQL Server中的表获取数据并生成 谓词的where子句将jdbc函数中的内容推入MySQL中的目标表。我正在Cloudera中使用Spark 2.3。

谓词代码:

.jdbc(s"<dest_db_url>", config.destTable,predicate, destOptions)

我传递给predicate数组的where子句是这样的

Array(
"(composite_primary_key1='ABCD' AND composite_primary_key2='123') OR 
(composite_primary_key1='EFGH' AND composite_primary_key2='456')",
"(composite_primary_key1='WXYZ' AND composite_primary_key2='342') OR
 (composite_primary_key1='QWYS' AND composite_primary_key2='664')"
)

predicate上方生成的代码段为

val predicates = sourceData
.map(row => s"(composite_primary_key1='${row.composite_primary_key1}' AND composite_primary_key2='${row.composite_primary_key2}')")
.reduce(_+" OR "+_)

下推到MySQL的查询就是这样

SELECT * FROM myTable WHERE 
(composite_primary_key1 IS NOT NULL AND composite_primary_key2 IS NOT NULL) AND 
((composite_primary_key1='ABCD' AND composite_primary_key2='123') OR 
(composite_primary_key1='EFGH' AND composite_primary_key2='456'))

(composite_primary_key1不为空,composite_primary_key2不为空)条件被添加到实际谓词之前。因此,查询所扫描的行比MySQL中所需的要多得多,从而使查询效率低下。

这种行为是非常随机的。有时spark会生成类似

的正确查询
SELECT * FROM myTable WHERE  
((composite_primary_key1='ABCD' AND composite_primary_key2='123') OR 
(composite_primary_key1='EFGH' AND composite_primary_key2='456'))

没有NULL检查。无法弄清楚这种随机行为。

0 个答案:

没有答案