Spark Hive报告pyspark.sql.utils.AnalysisException:u'Table not found:XXX'在纱线群集上运行时

时间:2016-12-21 13:16:10

标签: apache-spark hive ibm-cloud yarn biginsights

我正在尝试在访问Hive表的Cloud 4.2 Enterprise上的BigInsights上运行pyspark脚本。

首先我创建一个hive表:

[biadmin@bi4c-xxxxx-mastermanager ~]$ hive
hive> CREATE TABLE pokes (foo INT, bar STRING);
OK
Time taken: 2.147 seconds
hive> LOAD DATA LOCAL INPATH '/usr/iop/4.2.0.0/hive/doc/examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
Loading data to table default.pokes
Table default.pokes stats: [numFiles=1, numRows=0, totalSize=5812, rawDataSize=0]
OK
Time taken: 0.49 seconds
hive> 

然后我创建一个简单的pyspark脚本:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ cat test_pokes.py
from pyspark import SparkContext

sc = SparkContext()

from pyspark.sql import HiveContext
hc = HiveContext(sc)

pokesRdd = hc.sql('select * from pokes')
print( pokesRdd.collect() )

我尝试执行:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit \
    --master yarn-cluster \
    --deploy-mode cluster \
    --jars /usr/iop/4.2.0.0/hive/lib/datanucleus-api-jdo-3.2.6.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-core-3.2.10.jar, \
           /usr/iop/4.2.0.0/hive/lib/datanucleus-rdbms-3.2.9.jar \
    test_pokes.py

但是,我遇到了错误:

Traceback (most recent call last):
  File "test_pokes.py", line 8, in <module>
    pokesRdd = hc.sql('select * from pokes')
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/context.py", line 580, in sql
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/disk6/local/usercache/biadmin/appcache/application_1477084339086_0481/container_e09_1477084339086_0481_01_000001/pyspark.zip/pyspark/sql/utils.py", line 51, in deco
pyspark.sql.utils.AnalysisException: u'Table not found: pokes; line 1 pos 14'
End of LogType:stdout

如果我独立运行spark-submit,我可以看到该表存在正常:

[biadmin@bi4c-xxxxxx-mastermanager ~]$ spark-submit test_pokes.py
…
…
16/12/21 13:09:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 18962 bytes result sent to driver
16/12/21 13:09:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 168 ms on localhost (1/1)
16/12/21 13:09:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/12/21 13:09:13 INFO DAGScheduler: ResultStage 0 (collect at /home/biadmin/test_pokes.py:9) finished in 0.179 s
16/12/21 13:09:13 INFO DAGScheduler: Job 0 finished: collect at /home/biadmin/test_pokes.py:9, took 0.236558 s
[Row(foo=238, bar=u'val_238'), Row(foo=86, bar=u'val_86'), Row(foo=311, bar=u'val_311')
…
…

请参阅我之前与此问题相关的问题:hive spark yarn-cluster job fails with: "ClassNotFoundException: org.datanucleus.api.jdo.JDOPersistenceManagerFactory"

此问题与其他问题类似:Spark can access Hive table from pyspark but not from spark-submit。但是,与那个问题不同,我使用的是HiveContext。

更新:请点击此处查看最终解决方案https://stackoverflow.com/a/41272260/1033422

3 个答案:

答案 0 :(得分:4)

这是因为spark-submit作业无法找到--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml,因此无法连接到Hive Metastore。请将List添加到spark-submit命令。

答案 1 :(得分:2)

您似乎受到此错误的影响: https://issues.apache.org/jira/browse/SPARK-15345



我在HDP-2.5.0.0上遇到类似的Spark 1.6.2和2.0.0问题: 我的目标是在这些条件下从Hive SQL查询创建数据框:

  • python API,
  • cluster deploy-mode(在其中一个执行程序节点上运行的驱动程序)
  • 使用YARN来管理执行程序JVM(而不是独立的Spark主实例)。

初步测试给出了这些结果:

  1. spark-submit --deploy-mode client --master local ... =&gt; 的 WORKING
  2. spark-submit --deploy-mode client --master yarn ... =&gt;的 WORKING
  3. spark-submit --deploy-mode cluster --master yarn ...。 =&GT; 不工作
  4. 在#3的情况下,在其中一个执行程序节点上运行的驱动程序可以找到该数据库。错误是:

    pyspark.sql.utils.AnalysisException: 'Table or view not found: `database_name`.`table_name`; line 1 pos 14'
    



    Fokko Driesprong上面列出的答案对我有用。
    使用下面列出的命令,在执行程序节点上运行的驱动程序能够访问非default的数据库中的Hive表:

    $ /usr/hdp/current/spark2-client/bin/spark-submit \
    --deploy-mode cluster --master yarn \
    --files /usr/hdp/current/spark2-client/conf/hive-site.xml \
    /path/to/python/code.py
    



    我用来测试Spark 1.6.2和Spark 2.0.0的python代码是: (将SPARK_VERSION更改为1以使用Spark 1.6.2进行测试。确保相应地更新spark-submit命令中的路径)

    SPARK_VERSION=2      
    APP_NAME = 'spark-sql-python-test_SV,' + str(SPARK_VERSION)
    
    
    
    def spark1():
        from pyspark.sql import HiveContext
        from pyspark import SparkContext, SparkConf
    
        conf = SparkConf().setAppName(APP_NAME)
        sc = SparkContext(conf=conf)
        hc = HiveContext(sc)
    
        query = 'select * from database_name.table_name limit 5'
        df = hc.sql(query)
        printout(df)
    
    
    
    
    def spark2():
        from pyspark.sql import SparkSession
        spark = SparkSession.builder.appName(APP_NAME).enableHiveSupport().getOrCreate()
        query = 'select * from database_name.table_name limit 5'
        df = spark.sql(query)
        printout(df)
    
    
    
    
    def printout(df):
        print('\n########################################################################')
        df.show()
        print(df.count())
    
        df_list = df.collect()
        print(df_list)
        print(df_list[0])
        print(df_list[1])
        print('########################################################################\n')
    
    
    
    
    def main():
        if SPARK_VERSION == 1:
            spark1()
        elif SPARK_VERSION == 2:
            spark2()
    
    
    
    
    if __name__ == '__main__':
        main()
    

答案 2 :(得分:0)

对我来说,接受的答案不起作用。
(--files /usr/iop/4.2.0.0/hive/conf/hive-site.xml)

在代码文件顶部添加以下代码解决了它。

import findspark
findspark.init('/usr/share/spark-2.4')  # for 2.4