我正在使用Spark SQL读取Hive表并将其分配给scala val
val x = sqlContext.sql("select * from some_table")
然后我正在对数据帧x进行一些处理,最后得到一个数据帧y,它具有与表some_table完全相同的模式。
最后,我尝试将y数据帧覆盖到同一个hive表some_table
y.write.mode(SaveMode.Overwrite).saveAsTable().insertInto("some_table")
然后我收到错误
org.apache.spark.sql.AnalysisException:无法将覆盖插入到也从
读取的表中
我尝试创建一个insert sql语句并使用sqlContext.sql()触发它,但它也给了我同样的错误。
有什么办法可以绕过这个错误吗?我需要将记录插回到同一个表中。
您好我按照建议尝试了,但仍然遇到同样的错误。
val x = sqlContext.sql("select * from incremental.test2")
val y = x.limit(5)
y.registerTempTable("temp_table")
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("incremental.test2")
scala> dy.write.mode("overwrite").insertInto("incremental.test2")
org.apache.spark.sql.AnalysisException: Cannot insert overwrite into table that is also being read from.;
答案 0 :(得分:7)
您应该先将DataFrame y
保存在临时表
y.write.mode("overwrite").saveAsTable("temp_table")
然后您可以覆盖目标表中的行
val dy = sqlContext.table("temp_table")
dy.write.mode("overwrite").insertInto("some_table")
答案 1 :(得分:5)
实际上,您也可以使用检查点来实现此目的。由于它破坏了数据沿袭,因此Spark无法检测到您正在同一表中进行读取和覆盖:
-- Locate what we want first
CREATE TEMPORARY TABLE results (
SELECT e.id
r.basket
FROM email_routing r
JOIN email e ON e.id = r.message_id
WHERE r.sender_email_id = 21897
ORDER BY e.date desc LIMIT 0, 50 );
-- Again, having an index on email (id, date) seems like a good idea to me
-- (As a test you may want to add an index on results (id) here, shouldn't take long and
-- in MSSQl it would help build a better query plan, can't tell with MySQL)
-- return actual results
SELECT n1.full_name AS sender_full_name,
s1.email AS sender_email,
e.subject, e.body, e.attach, e.date, e.id, r.status,
n2.full_name AS receiver_full_name,
s2.email AS receiver_email,
r.basket,
FROM results r
JOIN email e ON e.id = r.message_id
JOIN people_emails s1 ON s1.id = r.sender_email_id
JOIN people n1 ON n1.id = s1.people_id
JOIN people_emails s2 ON s2.id = r.receiver_email_id
JOIN people n2 ON n2.id = s2.people_id
ORDER BY e.date desc
答案 2 :(得分:0)
您应该首先将DataFrame y
像实木复合地板文件一样保存:
y.write.parquet("temp_table")
像这样加载后:
val parquetFile = sqlContext.read.parquet("temp_table")
完成将数据插入表中
parquetFile.write.insertInto("some_table")
答案 3 :(得分:0)
在Spark 2.2中
'spark.sql.partitionProvider''spark.sql.sources.provider' 'spark.sql.sources.schema.numPartCols 'spark.sql.sources.schema.numParts''spark.sql.sources.schema.part.0' 'spark.sql.sources.schema.part.1''spark.sql.sources.schema.part.2' 'spark.sql.sources.schema.partCol.0' 'spark.sql.sources.schema.partCol.1'
https://querydb.blogspot.com/2019/07/read-from-hive-table-and-write-back-to.html
答案 4 :(得分:0)
从spark中的配置单元表中读取数据:
val hconfig =新的org.apache.hadoop.conf.Configuration() org.apache.hive.hcatalog.mapreduce.HCatInputFormat.setInput(hconfig,“ dbname”,“ tablename”)
val inputFormat =(新的HCatInputFormat).asInstanceOf [InputFormat [WritableComparable [_],HCatRecord]]。getClass
val data = sc.newAPIHadoopRDD(hconfig,inputFormat,classOf [WritableComparable [_]],classOf [HCatRecord])
答案 5 :(得分:0)
在执行此操作的情况下,您还将收到错误:“无法覆盖正在读取的路径”
这就像剪掉您正坐在的树枝:-(
答案 6 :(得分:0)
在执行以下操作之前,需要记住的是,您要覆盖的配置单元表应该是由配置单元DDL创建的,而不是由
创建的。spark(df.write.saveAsTable("<table_name>"))
如果以上内容均不正确,则此方法将无效。 我在spark 2.3.0中进行了测试
val tableReadDf=spark.sql("select * from <dbName>.<tableName>")
val updatedDf=tableReadDf.<transformation> //any update/delete/addition
updatedDf.createOrReplaceTempView("myUpdatedTable")
spark.sql("""with tempView as(select * from myUpdatedTable) insert overwrite table
<dbName>.<tableName> <partition><partition_columns> select * from tempView""")