我有一个Spark作业,该作业从SQL Server中的表获取数据并生成
谓词的where子句将jdbc
函数中的内容推入MySQL中的目标表。我正在Cloudera中使用Spark 2.3。
谓词代码:
.jdbc(s"<dest_db_url>", config.destTable,predicate, destOptions)
我传递给predicate
数组的where子句是这样的
Array(
"(composite_primary_key1='ABCD' AND composite_primary_key2='123') OR
(composite_primary_key1='EFGH' AND composite_primary_key2='456')",
"(composite_primary_key1='WXYZ' AND composite_primary_key2='342') OR
(composite_primary_key1='QWYS' AND composite_primary_key2='664')"
)
在predicate
上方生成的代码段为
val predicates = sourceData
.map(row => s"(composite_primary_key1='${row.composite_primary_key1}' AND composite_primary_key2='${row.composite_primary_key2}')")
.reduce(_+" OR "+_)
下推到MySQL的查询就是这样
SELECT * FROM myTable WHERE
(composite_primary_key1 IS NOT NULL AND composite_primary_key2 IS NOT NULL) AND
((composite_primary_key1='ABCD' AND composite_primary_key2='123') OR
(composite_primary_key1='EFGH' AND composite_primary_key2='456'))
(composite_primary_key1不为空,composite_primary_key2不为空)条件被添加到实际谓词之前。因此,查询所扫描的行比MySQL中所需的要多得多,从而使查询效率低下。
这种行为是非常随机的。有时spark会生成类似
的正确查询SELECT * FROM myTable WHERE
((composite_primary_key1='ABCD' AND composite_primary_key2='123') OR
(composite_primary_key1='EFGH' AND composite_primary_key2='456'))
没有NULL检查。无法弄清楚这种随机行为。