我有一个Spark作业,该作业从AWS Aurora Mysql数据库读取。不幸的是,由于其中一条记录的日期时间无效,该作业一直未能引发异常。
示例代码:
val jdbcUrl =
s"jdbc:mysql://$dbHostname:$dbPort/$dbName?zeroDateTimeBehavior=convertToNull&serverTimezone=UTC"
val props = reportConf.connectionProps(db)
val df = spark.read
.format("jdbc")
.option("url", jdbcUrl)
.option("dbtable",
s"(SELECT *, MOD($partitionColumn,10) AS partition_key FROM $table ORDER BY $partitionColumn) as $table") // DESC LIMIT 50000
.option("user", user)
.option("password", password)
.option("driver", driver)
.option("numPartitions", numPartitions)
.option("partitionColumn", "partition_key")
.option("lowerBound", 0)
.option("upperBound", 9)
.option("mode", "DROPMALFORMED")
.load()
.drop('partition_key)
我已经尝试在我的zeroDateTimeBehavior=convertToNull
中将零日期值转换为Null-jdbcUrl
,但无法正常工作。
最好,我想跳过记录或替换为一些默认值以供以后过滤,而不是手动识别数据库表中的不良记录。
有什么主意如何解决这个问题吗?
例外:
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:678)
Caused by: java.lang.IllegalArgumentException: DAY_OF_MONTH
at java.util.GregorianCalendar.computeTime(GregorianCalendar.java:2648)
at java.util.Calendar.updateTime(Calendar.java:3393)
at java.util.Calendar.getTimeInMillis(Calendar.java:1782)
at com.mysql.cj.jdbc.io.JdbcDateValueFactory.createFromDate(JdbcDateValueFactory.java:67)
at com.mysql.cj.jdbc.io.JdbcDateValueFactory.createFromDate(JdbcDateValueFactory.java:39)
at com.mysql.cj.core.io.ZeroDateTimeToNullValueFactory.createFromDate(ZeroDateTimeToNullValueFactory.java:41)
at com.mysql.cj.core.io.BaseDecoratingValueFactory.createFromDate(BaseDecoratingValueFactory.java:46)
at com.mysql.cj.core.io.BaseDecoratingValueFactory.createFromDate(BaseDecoratingValueFactory.java:46)
at com.mysql.cj.core.io.MysqlTextValueDecoder.decodeDate(MysqlTextValueDecoder.java:66)
at com.mysql.cj.mysqla.result.AbstractResultsetRow.decodeAndCreateReturnValue(AbstractResultsetRow.java:70)
at com.mysql.cj.mysqla.result.AbstractResultsetRow.getValueFromBytes(AbstractResultsetRow.java:225)
at com.mysql.cj.mysqla.result.TextBufferRow.getValue(TextBufferRow.java:122)
at com.mysql.cj.jdbc.result.ResultSetImpl.getNonStringValueFromRow(ResultSetImpl.java:630)
at com.mysql.cj.jdbc.result.ResultSetImpl.getDateOrTimestampValueFromRow(ResultSetImpl.java:643)
at com.mysql.cj.jdbc.result.ResultSetImpl.getDate(ResultSetImpl.java:788)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$2.apply(JdbcUtils.scala:389)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$2.apply(JdbcUtils.scala:387)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:356)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:338)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at org.apache.spark.util.CompletionIterator.hasNex