我可以在创建它之后阅读该表,但是如何在另一个spark会话中再次阅读它?
给定代码:
spark = SparkSession \
.builder \
.getOrCreate()
df = spark.read.parquet("examples/src/main/resources/users.parquet")
(df
.write
.saveAsTable("people_partitioned_bucketed"))
# retrieve rows from table as expected
spark.sql("select * from people_partitioned_bucketed").show()
spark.stop()
# open spark session again
spark = SparkSession \
.builder \
.getOrCreate()
# table not exist this time
spark.sql("select * from people_partitioned_bucketed").show()
```
执行结果:
+------+----------------+--------------+
| name|favorite_numbers|favorite_color|
+------+----------------+--------------+
|Alyssa| [3, 9, 15, 20]| null|
| Ben| []| red|
+------+----------------+--------------+
Traceback (most recent call last):
File "/home//workspace/spark/examples/src/main/python/sql/datasource.py", line 246, in <module>
spark.sql("select * from people_partitioned_bucketed").show()
File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/session.py", line 603, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Table or view not found: people_partitioned_bucketed; line 1 pos 14'
答案 0 :(得分:1)
对于基于文件的数据源,例如text,parquet,json等您可以通过路径选项指定自定义表路径,例如df.write.option(“path”,“/ some / path”).saveAsTable(“t”)。删除表时,将不会删除自定义表路径,并且表数据仍然存在。如果未指定自定义表路径,则Spark会将数据写入仓库目录下的默认表路径。删除表后,也将删除默认表路径。
换句话说,在使用path()
保存表时需要指定路径。如果未指定路径,则在关闭Spark会话时将删除该表。
答案 1 :(得分:1)
我和您有同样的问题,解决方案无处可寻。然后我读了this,我想出了一种方法。
将两个SparkSession对象的初始化更改为:
from os.path import abspath
warehouse_location = abspath('spark-warehouse')
spark = SparkSession.builder \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
此初始化明确指示Spark在哪里寻找Hive表,并启用Hive支持。您可以通过更改spark-warehouse
函数内部的参数来更改Hive表(即abspath()
)的位置。
PS:我不知道为什么需要显式启用Hive支持,因为默认情况下,saveAsTable()
将数据帧保存到Hive表中,我也不知道为什么有人需要显式定义spark-warehouse
位置因为默认位置是当前目录。尽管如此,上述解决方案仍然可行:)(是bug吗?)