如何在新的火花会话中再次阅读火花表?

时间:2018-01-24 06:48:12

标签: python apache-spark pyspark apache-spark-sql

我可以在创建它之后阅读该表,但是如何在另一个spark会话中再次阅读它?

给定代码:

spark = SparkSession \
    .builder \
    .getOrCreate()

df = spark.read.parquet("examples/src/main/resources/users.parquet")
(df
 .write
 .saveAsTable("people_partitioned_bucketed"))

# retrieve rows from table as expected
spark.sql("select * from people_partitioned_bucketed").show()

spark.stop()

# open spark session again
spark = SparkSession \
    .builder \
    .getOrCreate()

# table not exist this time
spark.sql("select * from people_partitioned_bucketed").show()

```

执行结果:

+------+----------------+--------------+
|  name|favorite_numbers|favorite_color|
+------+----------------+--------------+
|Alyssa|  [3, 9, 15, 20]|          null|
|   Ben|              []|           red|
+------+----------------+--------------+

Traceback (most recent call last):
  File "/home//workspace/spark/examples/src/main/python/sql/datasource.py", line 246, in <module>
    spark.sql("select * from people_partitioned_bucketed").show()
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/session.py", line 603, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/home//virtualenvs/spark/local/lib/python2.7/site-packages/pyspark/sql/utils.py", line 69, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Table or view not found: people_partitioned_bucketed; line 1 pos 14'

2 个答案:

答案 0 :(得分:1)

查看documentation

  

对于基于文件的数据源,例如text,parquet,json等您可以通过路径选项指定自定义表路径,例如df.write.option(“path”,“/ some / path”).saveAsTable(“t”)。删除表时,将不会删除自定义表路径,并且表数据仍然存在。如果未指定自定义表路径,则Spark会将数据写入仓库目录下的默认表路径。删除表后,也将删除默认表路径。

换句话说,在使用path()保存表时需要指定路径。如果未指定路径,则在关闭Spark会话时将删除该表。

答案 1 :(得分:1)

我和您有同样的问题,解决方案无处可寻。然后我读了this,我想出了一种方法。

将两个SparkSession对象的初始化更改为:

from os.path import abspath

warehouse_location = abspath('spark-warehouse')

spark = SparkSession.builder \
            .config("spark.sql.warehouse.dir", warehouse_location) \
            .enableHiveSupport() \
            .getOrCreate()

此初始化明确指示Spark在哪里寻找Hive表,并启用Hive支持。您可以通过更改spark-warehouse函数内部的参数来更改Hive表(即abspath())的位置。

PS:我不知道为什么需要显式启用Hive支持,因为默认情况下,saveAsTable()将数据帧保存到Hive表中,我也不知道为什么有人需要显式定义spark-warehouse位置因为默认位置是当前目录。尽管如此,上述解决方案仍然可行:)(是bug吗?)