我正在尝试在两张桌子上进行相当简单的连接,没有什么复杂的。 加载两个表,进行连接并更新列,但它不断抛出异常。
我注意到任务停留在最后一个分区199/200
上并最终崩溃。
我怀疑数据是否有偏差导致所有数据都加载到最后一个分区199
中。
SELECT COUNT(DISTINCT report_audit) FROM ReportDs = 1.5million.
虽然
SELECT COUNT(*) FROM ReportDs = 57million.
CPU:40核
记忆:160G
以下是我的示例代码:
...
def main(args: Array[String]) {
val log = LogManager.getRootLogger
log.setLevel(Level.INFO)
val conf = new SparkConf().setAppName("ExampleJob")
//.setMaster("local[*]")
//.set("spark.sql.shuffle.partitions", "3000")
//.set("spark.sql.crossJoin.enabled", "true")
.set("spark.storage.memoryFraction", "0.02")
.set("spark.shuffle.memoryFraction", "0.8")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.default.parallelism", (CPU * 3).toString)
val sparkSession = SparkSession.builder()
.config(conf)
.getOrCreate()
val reportOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> "REPORT_TBL",
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> RPT_NUM_PARTITION,
"lowerBound" -> RPT_LOWER_BOUND,
"upperBound" -> RPT_UPPER_BOUND,
"numPartitions" -> "200"
)
val accountOpts = Map(
"url" -> s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE",
"driver" -> "org.postgresql.Driver",
"dbtable" -> ACCOUNT_TBL,
"user" -> DB_USER,
"password"-> DB_PASSWORD,
"partitionColumn" -> ACCT_NUM_PARTITION,
"lowerBound" -> ACCT_LOWER_BOUND,
"upperBound" -> ACCT_UPPER_BOUND,
"numPartitions" -> "200"
)
val sc = sparkSession.sparkContext;
import sparkSession.implicits._
val reportDs = sparkSession.read.format("jdbc").options(reportOpts).load.cache().alias("a")
val accountDs = sparkSession.read.format("jdbc").options(accountOpts).load.cache().alias("c")
val reportData = reportDs.join(accountDs, reportDs("report_audit") === accountDs("reference_id"))
.withColumn("report_name", when($"report_id" === "xxxx-xxx-asd", $"report_id_ref_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_id_ref_2")
.otherwise($"report_id_ref_1" + ":" + $"report_id_ref_2"))
.withColumn("report_version", when($"report_id" === "xxxx-xxx-asd", $"report_version_1")
.when($"report_id" === "demoasd-asdad-asda", $"report_version_2")
.otherwise($"report_version_3"))
.withColumn("status", when($"report_id" === "xxxx-xxx-asd", $"report_status")
.when($"report_id" === "demoasd-asdad-asda", $"report_status_1")
.otherwise($"report_id"))
.select("...")
val prop = new Properties()
prop.setProperty("user", DB_USER)
prop.setProperty("password", DB_PASSWORD)
prop.setProperty("driver", "org.postgresql.Driver")
reportData.write
.mode(SaveMode.Append)
.jdbc(s"jdbc:postgresql://$DB_HOST:$DB_PORT/$DATABASE", "cust_report_data", prop)
sparkSession.stop()
我认为应该有一种优雅的方式来处理这种数据偏差。
请知道
答案 0 :(得分:2)
partitionColumn
,upperBound
和lowerBound
的值如果未正确设置,可能会导致这种确切的行为。例如,如果lowerBound == upperBound
,则无论numPartitions
如何,所有数据都将加载到单个分区中。
这些属性的组合决定了哪些(或多少)记录从SQL数据库加载到DataFrame
分区。