我正在使用pyspark的(Spark 2.3.2)saveAsTable
,如下所示:
df.write.format("parquet") \
.sortBy("id") \
.bucketBy(50, "some_column") \
.option("path", "test_table.parquet") \
.saveAsTable("test_table", mode="overwrite")
在表已存在的情况下(因此模式为“覆盖”),这将导致NoSuchTableException
:
org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'test_table' not found in database 'test_database';
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:184)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:927)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
at org.apache.spark.sql.execution.datasources.CatalogFileIndex.listFiles(CatalogFileIndex.scala:59)
at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$FileSourceScanExec$$selectedPartitions$lzycompute(DataSourceScanExec.scala:189)
at org.apache.spark.sql.execution.FileSourceScanExec.org$apache$spark$sql$execution$FileSourceScanExec$$selectedPartitions(DataSourceScanExec.scala:186)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:308)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107)
at org.apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:102)
at org.apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:43)
at org.apache.spark.sql.execution.CacheManager.org$apache$spark$sql$execution$CacheManager$$recacheByCondition(CacheManager.scala:145)
at org.apache.spark.sql.execution.CacheManager$$anonfun$recacheByPath$1.apply$mcV$sp(CacheManager.scala:201)
at org.apache.spark.sql.execution.CacheManager$$anonfun$recacheByPath$1.apply(CacheManager.scala:194)
at org.apache.spark.sql.execution.CacheManager$$anonfun$recacheByPath$1.apply(CacheManager.scala:194)
at org.apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67)
at org.apache.spark.sql.execution.CacheManager.recacheByPath(CacheManager.scala:194)
at org.apache.spark.sql.internal.CatalogImpl.refreshByPath(CatalogImpl.scala:508)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:174)
at org.apache.spark.sql.execution.datasources.DataSource.writeAndRead(DataSource.scala:532)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.saveDataIntoTable(createDataSourceTables.scala:216)
at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:176)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at org.apache.spark.sql.DataFrameWriter.createTable(DataFrameWriter.scala:458)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:433)
at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:393)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
看起来现有表成功dropped,但是following attempt to create the new table似乎要求该表存在(请参阅stacktrace的第二行)。这是一个错误还是我错过了什么?
答案 0 :(得分:0)
在 Spark 2.4 中,它似乎运行良好,要点是尝试创建示例数据帧,然后将其通过spark写入Hive。
from pyspark.sql import Row
l = [('Ankit',25),('Jalfaizy',22),('Magesh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = spark.createDataFrame(people)
schemaPeople.write.format("parquet").saveAsTable("test_table_spark", mode="overwrite")
成功写入后,检查以验证Hive表,然后修改数据框并执行相同的saveAsTable
数据被新数据框覆盖的操作
l = [('Ankit',25),('Jalfaizy',22),('Suresh',20),('Bala',26)]
请在您的Spark Shell中尝试一下,看看是否可行...
尝试在外部配置单元表中执行相同操作
>>> schemaPeople.show()
+---+--------+
|age| name|
+---+--------+
| 25| Ankit|
| 22|Jalfaizy|
| 20| Suresh|
| 26| Bala|
+---+--------+
>>> spark.sql("SELECT * FROM EXT_Table_Test").show()
+---+--------+
|age| name|
+---+--------+
| 25| Ankit|
| 22|Jalfaizy|
| 20| Magesh|
| 26| Bala|
+---+--------+
>>> schemaPeople.write.format("parquet") \
... .option("path", "hdfs://path/tables/EXT_Table_Test") \
... .saveAsTable("test_table", mode="overwrite")
再次读取更新的表会导致以下错误
由于:java.io.FileNotFoundException:文件不存在: hdfs:/// tables / EXT_Table_Test / 000000_0可能 基础文件已更新。您可以明确使 通过在SQL中运行'REFRESH TABLE tableName'命令在Spark中进行缓存 通过重新创建所涉及的数据集/ DataFrame。
>>> spark.sql("SELECT * FROM EXT_Table_Test").show()
完成刷新表之后,从Spark读取成功,尽管在刷新之前我能够在HIVE shell中看到更新的数据。
>>> spark.sql("REFRESH TABLE EXT_Table_Test")
DataFrame[]
>>> spark.sql("SELECT * FROM EXT_Table_Test").show()
+---+--------+
|age| name|
+---+--------+
| 25| Ankit|
| 22|Jalfaizy|
| 20| Suresh|
| 26| Bala|
+---+--------+
答案 1 :(得分:0)
在spark 2.4中,创建带有覆盖失败的表。要解决此问题,请设置以下属性。
将标志spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation
设置为true。
对于pyspark,请使用以下命令:
spark.conf.set("spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation","true")