Question

我正在使用Spark SQL读取Hive表并将其分配给scala val

val x = sqlContext.sql("select * from some_table")

然后我正在对数据帧x进行一些处理，最后得到一个数据帧y，它具有与表some_table完全相同的模式。

最后，我尝试将y数据帧覆盖到同一个hive表some_table

y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table")

然后我收到错误

org.apache.spark.sql.AnalysisException：无法将覆盖插入到也从
读取的表中

我尝试创建一个insert sql语句并使用sqlContext.sql（）触发它，但它也给了我同样的错误。

有什么办法可以绕过这个错误吗？我需要将记录插回到同一个表中。

您好我按照建议尝试了，但仍然遇到同样的错误。

val x = sqlContext.sql("select * from incremental.test2")
val y = x.limit(5)
y.registerTempTable("temp_table")
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("incremental.test2")

scala> dy.write.mode("overwrite").insertInto("incremental.test2")
             org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;

Answer 1

您应该先将DataFrame y保存在临时表

中

y.write.mode("overwrite").saveAsTable("temp_table")

然后您可以覆盖目标表中的行

val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("some_table")

Answer 2

实际上，您也可以使用检查点来实现此目的。由于它破坏了数据沿袭，因此Spark无法检测到您正在同一表中进行读取和覆盖：

-- Locate what we want first
CREATE TEMPORARY TABLE results (
    SELECT e.id
           r.basket
    FROM email_routing r 
    JOIN email e ON e.id = r.message_id 
    WHERE r.sender_email_id = 21897 
    ORDER BY e.date desc LIMIT 0, 50 );

-- Again, having an index on email (id, date) seems like a good idea to me

-- (As a test you may want to add an index on results (id) here, shouldn't take long and
--  in MSSQl it would help build a better query plan, can't tell with MySQL)

-- return actual results
SELECT n1.full_name AS sender_full_name, 
       s1.email AS sender_email, 
       e.subject, e.body, e.attach, e.date, e.id, r.status, 
       n2.full_name AS receiver_full_name, 
       s2.email AS receiver_email, 
       r.basket, 
FROM results r 
JOIN email e ON e.id = r.message_id 
JOIN people_emails s1 ON s1.id = r.sender_email_id 
JOIN people n1 ON n1.id = s1.people_id 
JOIN people_emails s2 ON s2.id = r.receiver_email_id 
JOIN people n2 ON n2.id = s2.people_id 
ORDER BY e.date desc

Answer 3

您应该首先将DataFrame y像实木复合地板文件一样保存：

y.write.parquet("temp_table")

像这样加载后：

val parquetFile = sqlContext.read.parquet("temp_table")

完成将数据插入表中

parquetFile.write.insertInto("some_table")

Answer 4

在Spark 2.2中

此错误表示我们的进程正在从同一表读取并向同一表写入。
通常，这应该在进程写入目录.hiveStaging ...
使用saveAsTable方法时会发生此错误，因为它会覆盖整个表而不是单个分区。
insertInto方法不会发生此错误，因为它会覆盖分区而不是表。
发生这种情况的原因是因为Hive表在其定义中具有以下Spark TBLProperties。如果您删除以下Spark TBLProperties-

'spark.sql.partitionProvider''spark.sql.sources.provider' 'spark.sql.sources.schema.numPartCols 'spark.sql.sources.schema.numParts''spark.sql.sources.schema.part.0' 'spark.sql.sources.schema.part.1''spark.sql.sources.schema.part.2' 'spark.sql.sources.schema.partCol.0' 'spark.sql.sources.schema.partCol.1'

https://querydb.blogspot.com/2019/07/read-from-hive-table-and-write-back-to.html

Answer 5

从spark中的配置单元表中读取数据：

val hconfig =新的org.apache.hadoop.conf.Configuration（） org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput（hconfig，“ dbname”，“ tablename”）

val inputFormat =（新的HCatInputFormat）.asInstanceOf [InputFormat [WritableComparable [_]，HCatRecord]]。getClass

val data = sc.newAPIHadoopRDD（hconfig，inputFormat，classOf [WritableComparable [_]]，classOf [HCatRecord]）

Answer 6

在执行此操作的情况下，您还将收到错误：“无法覆盖正在读取的路径”

您正在从视图“ V”（执行您的逻辑）“插入覆盖”到配置单元表“ A”
该VIEW还引用了相同的表“ A”。我发现这很困难，因为VIEW也是查询“ A”的深层嵌套代码。闷闷不乐。

这就像剪掉您正坐在的树枝：-（

Answer 7

在执行以下操作之前，需要记住的是，您要覆盖的配置单元表应该是由配置单元DDL创建的，而不是由

创建的。

spark(df.write.saveAsTable("<table_name>"))

如果以上内容均不正确，则此方法将无效。我在spark 2.3.0中进行了测试

val tableReadDf=spark.sql("select * from <dbName>.<tableName>")
val updatedDf=tableReadDf.<transformation> //any update/delete/addition 
updatedDf.createOrReplaceTempView("myUpdatedTable")
spark.sql("""with tempView as(select * from myUpdatedTable) insert overwrite table 
<dbName>.<tableName> <partition><partition_columns> select * from tempView""")

从hive表中读取并使用spark sql写回来

7 个答案: