Python 2.7
Pyspark 2.2.1
JDBC format for MySQL->Spark DF
For writing Spark DF-> AWS Redshift i am using the `Spark-Redshift` driver from Databricks.
由于上下文和我根据输入参数的原因,我正在从MySQL表中将数据读取到Spark中,我需要获取今天之前更新的所有记录,或者仅获取直到请求日期之前更新的记录(包括在内。
spark.read.format("jdbc")
.option("url", "url")
.option("driver", driver)
.option("dbtable", query)
.load()
,查询为
if days > 0:
get_date = date.today() - timedelta(days)
query = "(SELECT * FROM {} WHERE CAST({}.updatedAt AS date) >= DATE('{}') " \
"AND CAST({}.updatedAt AS date) < CURDATE()) AS t".format(table, table, get_date, table)
elif days == 0:
query = "(SELECT * FROM {} WHERE CAST({}.updatedAt AS date) < CURDATE() " \
"OR updatedAt IS NULL) AS t".format(table, table)
一旦将数据读入Spark数据帧并且ETL不包含其他与时间戳相关的操作,该时间戳列将被丢弃。最后一步是将操作记录写入AWS Redshift表。
我的问题是应用程序有时会崩溃并显示
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.sql.Timestamp
写入Redshift时,但是我想问题出在读取时转换,而只是Spark惰性执行在写入Redshift时给出了异常(目标Redshift表中没有时间戳或日期列)< / p>
在过去的一个月中,对于每天运行的4个不同的作业,我在日志中大约有15%的时间遇到了此异常,然后该作业失败了,但是在大多数情况下它运行良好,这使得无法重现进一步发布或调试。
我怀疑SQL查询中的String-> Timestamp强制转换会导致问题,但是我不确定如何以另一种方式实现相同的目标,而不会抛出此异常。任何帮助,不胜感激!
更多堆栈跟踪信息:
py4j.protocol.Py4JJavaError: An error occurred while calling o827.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:213)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:166)
和
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 601 in stage 93.0 failed 4 times, most recent failure: Lost task 601.3 in stage 93.0 (TID 5282, url, executor 5): org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:270)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:189)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:188)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.sql.Timestamp
at com.databricks.spark.redshift.RedshiftWriter$$anonfun$7$$anonfun$apply$3.apply(RedshiftWriter.scala:234)
at com.databricks.spark.redshift.RedshiftWriter$$anonfun$7$$anonfun$apply$3.apply(RedshiftWriter.scala:233)
at com.databricks.spark.redshift.RedshiftWriter$$anonfun$8$$anonfun$apply$5.apply(RedshiftWriter.scala:252)
at com.databricks.spark.redshift.RedshiftWriter$$anonfun$8$$anonfun$apply$5.apply(RedshiftWriter.scala:248)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:324)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:254)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1371)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:259)