如何将Delta与Spark 3.0 Preview结合使用?

时间:2019-11-30 12:14:57

标签: apache-spark delta-lake

SPARK 3.0无法将DF保存为HDFS中的增量表

  • Scala版本2.12.10
  • Spark 3.0版预览版

可以在2.4.4中执行此操作,但未创建分区。

输入示例:

Vehicle_id|model|brand|year|miles|intake_date_time

v0001H|verna|Hyundai|2011|5000|2018-01-20 06:30:00

v0001F|Eco-sport|Ford|2013|4000|2018-02-10 06:30:00

v0002F|Endeavour|Ford|2011|8000|2018-04-12 06:30:00

v0001L|Gallardo|Lambhorghini|2013|2000|2018-05-16 06:30:00
// reading 
val deltaTableInput1 = spark.read
                            .format("com.databricks.spark.csv")
                            .option("header","true")
                            .option("delimiter","|")
                            .option("inferSchema","true")
                            .load("file")
                            .selectExpr("Vehicle_id","model","brand","year","month","miles","CAST(concat(substring(intake_date_time,7,4),concat(substring(intake_date_time,3,4),concat(substring(intake_date_time,1,2),substring(intake_date_time,11,9)))) AS TIMESTAMP) as intake_date_time")  

// Writing
 deltaTableInput1.write
                 .mode("overwrite")
                 .partitionBy("brand","model","year","month")
                 .format("delta")
                 .save("path")

错误:

  

com.google.common.util.concurrent.ExecutionError:java.lang.NoSuchMethodError:org.apache.spark.util.Utils $ .classForName(Ljava / lang / String;)Ljava / lang / Class;          在com.google.common.cache.LocalCache $ Segment.get(LocalCache.java:2261)          在com.google.common.cache.LocalCache.get(LocalCache.java:4000)          在com.google.common.cache.LocalCache $ LocalManualCache.get(LocalCache.java:4789)          在org.apache.spark.sql.delta.DeltaLog $ .apply(DeltaLog.scala:714)          在org.apache.spark.sql.delta.DeltaLog $ .forTable(DeltaLog.scala:676)          在org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:124)          在org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)          在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:71)          在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:69)          在org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:87)          在org.apache.spark.sql.execution.SparkPlan。$ anonfun $ execute $ 1(SparkPlan.scala:189)中          在org.apache.spark.sql.execution.SparkPlan。$ anonfun $ executeQuery $ 1(SparkPlan.scala:227)          在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)          在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:224)          在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:185)          在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:110)          在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:109)          在org.apache.spark.sql.DataFrameWriter。$ anonfun $ runCommand $ 1(DataFrameWriter.scala:829)          位于org.apache.spark.sql.execution.SQLExecution $。$ anonfun $ withNewExecutionId $ 4(SQLExecution.scala:100)          在org.apache.spark.sql.execution.SQLExecution $ .withSQLConfPropagated(SQLExecution.scala:160)          在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:87)          在org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:829)          在org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:309)          在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)          在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:236)          ... 47消失        引起原因:java.lang.NoSuchMethodError:org.apache.spark.util.Utils $ .classForName(Ljava / lang / String;)Ljava / lang / Class;          在org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:122)          在org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore $(LogStore.scala:120)          在org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)          在org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore(LogStore.scala:117)          在org.apache.spark.sql.delta.storage.LogStoreProvider.createLogStore $(LogStore.scala:115)          在org.apache.spark.sql.delta.DeltaLog.createLogStore(DeltaLog.scala:58)          在org.apache.spark.sql.delta.DeltaLog。(DeltaLog.scala:79)          在org.apache.spark.sql.delta.DeltaLog $$ anon $ 3. $ anonfun $ call $ 2(DeltaLog.scala:718)          在org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper $ .allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)          在org.apache.spark.sql.delta.DeltaLog $$ anon $ 3. $ anonfun $ call $ 1(DeltaLog.scala:718)          在com.databricks.spark.util.DatabricksLogging.recordOperation(DatabricksLogging.scala:77)          在com.databricks.spark.util.DatabricksLogging.recordOperation $(DatabricksLogging.scala:67)          在org.apache.spark.sql.delta.DeltaLog $ .recordOperation(DeltaLog.scala:645)          在org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation(DeltaLogging.scala:103)          在org.apache.spark.sql.delta.metering.DeltaLogging.recordDeltaOperation $(DeltaLogging.scala:89)          在org.apache.spark.sql.delta.DeltaLog $ .recordDeltaOperation(DeltaLog.scala:645)          在org.apache.spark.sql.delta.DeltaLog $$ anon $ 3.call(DeltaLog.scala:717)          在org.apache.spark.sql.delta.DeltaLog $$ anon $ 3.call(DeltaLog.scala:714)          在com.google.common.cache.LocalCache $ LocalManualCache $ 1.load(LocalCache.java:4792)          位于com.google.common.cache.LocalCache $ LoadingValueReference.loadFuture(LocalCache.java:3599)          在com.google.common.cache.LocalCache $ Segment.loadSync(LocalCache.java:2379)          在com.google.common.cache.LocalCache $ Segment.lockedGetOrLoad(LocalCache.java:2342)          在com.google.common.cache.LocalCache $ Segment.get(LocalCache.java:2257)          ...还有71

在REPL的Spark 2.4.4中,它无需分区即可编写。

Spark 3.0错误

1 个答案:

答案 0 :(得分:0)

slack上找到:

  

Spark 3.0与Spark 2.4明显不同,因此无法正常工作

     

虽然有分支? https://github.com/delta-io/delta/tree/spark-3.0-snapshot