Question

我在Scala中有以下程序用于火花：

val dfA = sqlContext.sql("select * from employees where id in ('Emp1', 'Emp2')" )
val dfB = sqlContext.sql("select * from employees where id not in ('Emp1', 'Emp2')" )
val dfN = dfA.withColumn("department", lit("Finance"))
val dfFinal = dfN.unionAll(dfB)
dfFinal.registerTempTable("intermediate_result")

dfA.unpersist
dfB.unpersist
dfN.unpersist
dfFinal.unpersist

val dfTmp = sqlContext.sql("select * from intermediate_result")
dfTmp.write.mode("overwrite").format("parquet").saveAsTable("employees")
dfTmp.unpersist

当我尝试保存它时，出现以下错误：

org.apache.spark.sql.AnalysisException：无法覆盖也正在读取的表employees。在org.apache.spark.sql.execution.datasources.PreWriteCheck.failAnalysis（rules.scala：106）在org.apache.spark.sql.execution.datasources.PreWriteCheck $$ anonfun $ apply $ 3.apply（rules.scala：182）在org.apache.spark.sql.execution.datasources.PreWriteCheck $$ anonfun $ apply $ 3.apply（rules.scala：109）在org.apache.spark.sql.catalyst.trees.TreeNode.foreach（TreeNode.scala：111）在org.apache.spark.sql.execution.datasources.PreWriteCheck.apply（rules.scala：109）在org.apache.spark.sql.execution.datasources.PreWriteCheck.apply（rules.scala：105）在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 2.apply（CheckAnalysis.scala：218）在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 2.apply（CheckAnalysis.scala：218）在scala.collection.immutable.List.foreach（List.scala：318）

我的问题是：

我的方法是否正确改变两名员工的部门
发布DataFrames时，为什么会出现此错误

Answer 1

我的方法是否正确改变两名员工的部门

不是。只是重复在Stack Overflow上多次说过的话 - Apache Spark不是数据库。它不是为细粒度更新而设计的。如果您的项目需要这样的操作，请使用Hadoop上的许多数据库之一。

为什么我在发布DataFrames时出现此错误

因为你没有。您所做的就是为执行计划添加一个名称。检查点是最接近＆＃34;释放＆＃34;，但是你真的不希望在执行器松动时处于破坏性操作的中间。

您可以写入临时目录，删除输入并移动临时文件，但实际上 - 只需使用适合该作业的工具。

Answer 2

以下是您可以尝试的方法。

您可以使用saveAsTable api将其写入另一个表，而不是使用registertemptable api

dfFinal.write.mode("overwrite").saveAsTable("intermediate_result")

然后，将其写入员工表

 val dy = sqlContext.table("intermediate_result")
  dy.write.mode("overwrite").insertInto("employees")

最后，删除intermediate_result表。

Answer 3

我会这样做，

catch (bad_alloc &e)
{
   cout << "Catching bad_alloc: " << e.what() << endl;
}
catch (exception &e)
{
   cout << "Catching exception: " << e.what() << endl;
}

为了模仿您的流量，我创建了2个数据框，执行>>> df = sqlContext.sql("select * from t") >>> df.show() +-------------+---------------+ |department_id|department_name| +-------------+---------------+ | 2| Fitness| | 3| Footwear| | 4| Apparel| | 5| Golf| | 6| Outdoors| | 7| Fan Shop| +-------------+---------------+并回写相同的表 union（在此示例中故意删除t）< / p>

department_id = 4

Answer 4

让我们说这是一个你正在阅读和覆盖的蜂巢表。

请将时间戳引入蜂巢表位置，如下所示

q)aj[`date`sym;update date:`date$dt from data;info]
dt                            sym  bid ask date       shares divisor
--------------------------------------------------------------------
2017.01.02D07:57:14.764000000 GOOG 101 109 2017.01.02 200    2
2017.01.02D02:31:39.330000000 AAPL 100 105 2017.01.02 200    2
2017.01.02D04:25:17.604000000 AAPL 102 107 2017.01.02 200    2
2017.01.01D01:47:51.333000000 GOOG 104 106 2017.01.01 100    1
2017.01.02D15:50:12.140000000 AAPL 101 107 2017.01.02 200    2
2017.01.01D02:59:16.636000000 GOOG 102 106 2017.01.01 100    1
2017.01.01D14:35:31.860000000 AAPL 100 107 2017.01.01 500    2
2017.01.01D16:36:29.214000000 GOOG 101 108 2017.01.01 100    1
2017.01.01D14:01:18.498000000 GOOG 101 107 2017.01.01 100    1
2017.01.02D08:31:52.958000000 AAPL 102 109 2017.01.02 200    2

由于无法覆盖，我们会将输出文件写入新位置。

使用数据框Api

将数据写入新位置

    create table table_name (
  id                int,
  dtDontQuery       string,
  name              string
)
 Location hdfs://user/table_name/timestamp

将数据写入后，将hive表位置更改为新位置

df.write.orc(hdfs://user/xx/tablename/newtimestamp/)

如何更新Spark

4 个答案: