Question

尝试从Pyspark读取和写入存储在远程Hive Server中的数据。我按照这个例子：

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row

# warehouse_location points to the default location for managed databases and tables
warehouse_location = 'hdfs://quickstart.cloudera:8020/user/hive/warehouse'

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

示例显示了如何在仓库中创建新表：

# spark is an existing SparkSession
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

# Queries are expressed in HiveQL
spark.sql("SELECT * FROM src").show()

但是，我需要访问在iris中创建的现有tabe mytest.db，因此表格位置为

table_path = warehouse_location + '/mytest.db/iris`

如何从现有表中进行选择？

更新

我有Metastore url：

http://test.mysite.net:8888/metastore/table/mytest/iris

和表位置网址：

hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris

在上面的代码中使用hdfs://quickstart.cloudera:8020/user/hive/warehouse作为仓库位置并尝试：

spark.sql("use mytest")

我得到例外：

    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Database 'mytest' not found;"

从iris中选择什么是正确的网址？

Answer 1

您可以使用

直接调用该表

spark.sql("SELECT * FROM mytest.iris")

或者使用

指定要使用的数据库

spark.sql("use mytest")
spark.sql("SELECT * FROM iris)

Pyspark：在远程Hive Server中选择数据

1 个答案: