何时在spark中执行REFRESH TABLE my_table?

时间:2018-03-12 11:46:24

标签: apache-spark hive apache-spark-sql

考虑一下代码;

 import org.apache.spark.sql.hive.orc._
 import org.apache.spark.sql._

 val path = ...
 val dataFrame:DataFramew = ...

 val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
 dataFrame.createOrReplaceTempView("my_table")
 val results = hiveContext.sql(s"select * from my_table")
 results.write.mode(SaveMode.Append).partitionBy("my_column").format("orc").save(path)
 hiveContext.sql("REFRESH TABLE my_table")

此代码使用相同的路径但不同的dataFrame执行两次。第一次运行是成功的,但随后出现错误:

Caused by: java.io.FileNotFoundException: File does not exist: hdfs://somepath/somefile.snappy.orc
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

我试图清理缓存,调用hiveContext.dropTempTable("tableName")并且都没有效果。何时在(其他变体)之前,之后调用REFRESH TABLE tableName来修复此类错误?

1 个答案:

答案 0 :(得分:2)

对于Google员工;

可以在写操作之前运行spark.catalog.refreshTable(tableName)spark.sql(s"REFRESH TABLE $tableName")。我遇到了同样的问题,并且解决了我的问题。

spark.catalog.refreshTable(tableName)
df.write.mode(SaveMode.Overwrite).insertInto(tableName)