Spark hiveContext不会加载Dataframe

时间:2015-12-08 02:05:07

标签: python apache-spark dataframe pyspark

我试图使用' Window' Spark DataFrames中的函数。我知道我需要使用HiveContext(需要Hive)。所以我使用以下命令构建了Spark:

./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver

然而,当我尝试从Python调用HiveContext时,我收到以下错误:

  
    

您必须使用Hive构建Spark。导出' SPARK_HIVE = true'并运行build / sbt assembly",Py4JJavaError(u'调用None.org.apache.spark.sql.hive.HiveContext时出错。\ n',JavaObject id = o264))

  

当我使用sqlContext._get_hive_ctx()打印有关Hive错误的详细信息时,我得到:

  
    

Py4JJavaError:调用None.org.apache.spark.sql.hive.HiveContext时发生错误。     :java.lang.RuntimeException:java.lang.RuntimeException:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient         在org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)         在org.apache.spark.sql.hive.client.ClientWrapper。(ClientWrapper.scala:171)         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)         在org.apache.spark.sql.hive.client.IsolatedClientLoader.liftedTree1 $ 1(IsolatedClientLoader.scala:183)         在org.apache.spark.sql.hive.client.IsolatedClientLoader。(IsolatedClientLoader.scala:179)         在org.apache.spark.sql.hive.HiveContext.metadataHive $ lzycompute(HiveContext.scala:226)         在org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:185)         在org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:392)         在org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:174)         在org.apache.spark.sql.hive.HiveContext。(HiveContext.scala:177)         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)         在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)         在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)         在py4j.Gateway.invoke(Gateway.java:214)         在py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)         在py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)         在py4j.GatewayConnection.run(GatewayConnection.java:207)         在java.lang.Thread.run(Thread.java:745)     引起:java.lang.RuntimeException:无法实例化org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient         在org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1523)         在org.apache.hadoop.hive.metastore.RetryingMetaStoreClient。(RetryingMetaStoreClient.java:86)         在org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)         在org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)         在org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)         在org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)         在org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)     还有23个     引起:java.lang.reflect.InvocationTargetException         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)         在org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)     还有29个     引起:javax.jdo.JDOFatalDataStoreException:无法启动数据库&#cos; metastore_db'使用类加载器org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@7ffe486b,有关详细信息,请参阅下一个异常。     NestedThrowables:     java.sql.SQLException:无法启动数据库ū metastore_db'使用类加载器org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@7ffe486b,有关详细信息,请参阅下一个异常。         at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:436)         在org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:788)         在org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)         在org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:606)         在javax.jdo.JDOHelper $ 16.run(JDOHelper.java:1965)         at java.security.AccessController.doPrivileged(Native Method)         在javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)         在javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)         在javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)         在javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)         在org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)         at org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)         at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)         at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)         在org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)         在org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)         在org.apache.hadoop.hive.metastore.RawStoreProxy。(RawStoreProxy.java:57)         在org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)         在org.apache.hadoop.hive.metastore.HiveMetaStore $ HMSHandler.newRawStore(HiveMetaStore.java:593)         在org.apache.hadoop.hive.metastore.HiveMetaStore $ HMSHandler.getMS(HiveMetaStore.java:571)         在org.apache.hadoop.hive.metastore.HiveMetaStore $ HMSHandler.createDefaultDB(HiveMetaStore.java:624)         在org.apache.hadoop.hive.metastore.HiveMetaStore $ HMSHandler.init(HiveMetaStore.java:461)         在org.apache.hadoop.hive.metastore.RetryingHMSHandler。(RetryingHMSHandler.java:66)         在org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)         在org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)         在org.apache.hadoop.hive.metastore.HiveMetaStoreClient。(HiveMetaStoreClient.java:199)         在org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient。(SessionHiveMetaStoreClient.java:74)     还有34个     引起:java.sql.SQLException:无法启动数据库' metastore_db'使用类加载器org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@7ffe486b,有关详细信息,请参阅下一个异常。         at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(未知来源)         at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(未知来源)         在org.apache.derby.impl.jdbc.Util.seeNextException(未知来源)         在org.apache.derby.impl.jdbc.EmbedConnection.bootDatabase(未知来源)         在org.apache.derby.impl.jdbc.EmbedConnection。(未知来源)         在org.apache.derby.impl.jdbc.EmbedConnection40。(未知来源)         在org.apache.derby.jdbc.Driver40.getNewEmbedConnection(未知来源)         在org.apache.derby.jdbc.InternalDriver.connect(未知来源)         在org.apache.derby.jdbc.Driver20.connect(未知来源)         在org.apache.derby.jdbc.AutoloadedDriver.connect(未知来源)         在java.sql.DriverManager.getConnection(DriverManager.java:571)         在java.sql.DriverManager.getConnection(DriverManager.java:187)         在org.apache.commons.dbcp.DriverManagerConnectionFactory.createConnection(DriverManagerConnectionFactory.java:78)         在org.apache.commons.dbcp.PoolableConnectionFactory.makeObject(PoolableConnectionFactory.java:582)         在org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:1148)         在org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:106)         at org.datanucleus.store.rdbms.ConnectionFactoryImpl $ ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:501)         在org.datanucleus.store.rdbms.RDBMSStoreManager。(RDBMSStoreManager.java:298)         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)         在org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)         at org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)         at org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)         在org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)         在org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)     还有63个     引起:java.sql.SQLException:无法启动数据库' metastore_db'使用类加载器org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@7ffe486b,有关详细信息,请参阅下一个异常。         at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(未知来源)         at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)     还有90个     引起:java.sql.SQLException:Derby的另一个实例可能已经启动了数据库/Applications/spark-1.5.2/metastore_db。         at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(未知来源)         at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)         at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(未知来源)         在org.apache.derby.impl.jdbc.Util.generateCsSQLException(未知来源)     还有87个     引发者:错误XSDB6:Derby的另一个实例可能已经启动了数据库/Applications/spark-1.5.2/metastore_db。         at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)         at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)         在org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(未知来源)         at java.security.AccessController.doPrivileged(Native Method)         at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source)         at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(未知来源)         在org.apache.derby.impl.services.monitor.BaseMonitor.boot(未知来源)         在org.apache.derby.impl.services.monitor.TopService.bootModule(未知来源)         在org.apache.derby.impl.services.monitor.BaseMonitor.startModule(未知来源)         在org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(未知来源)         在org.apache.derby.impl.store.raw.RawStore.boot(未知来源)         在org.apache.derby.impl.services.monitor.BaseMonitor.boot(未知来源)         在org.apache.derby.impl.services.monitor.TopService.bootModule(未知来源)         在org.apache.derby.impl.services.monitor.BaseMonitor.startModule(未知来源)         在org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(未知来源)         at org.apache.derby.impl.store.access.RAMAccessManager.boot(未知来源)         在org.apache.derby.impl.services.monitor.BaseMonitor.boot(未知来源)         在org.apache.derby.impl.services.monitor.TopService.bootModule(未知来源)         在org.apache.derby.impl.services.monitor.BaseMonitor.startModule(未知来源)         在org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(未知来源)         at org.apache.derby.impl.db.BasicDatabase.bootStore(未知来源)         在org.apache.derby.impl.db.BasicDatabase.boot(未知来源)         在org.apache.derby.impl.services.monitor.BaseMonitor.boot(未知来源)         在org.apache.derby.impl.services.monitor.TopService.bootModule(未知来源)         at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)         at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source)         at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source)         at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source)         在org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(未知来源)

  

我正在运行Spark 1.5.2并通过IPython调用它。作为参考,这里是生成错误的代码:

from __future__ import print_function

import os
import sys
import pandas as pd
import time

from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext, SQLContext
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType, DoubleType
from pyspark import context
from pyspark.sql import functions as F
from pyspark.sql.window import Window

from pyspark.sql.functions import *

sc = SparkContext(appName="Bench")
sqlContext = HiveContext(sc)

DATA=

try:
    df = sqlContext.read.load(DATA+"/converted/dataset.parquet", format="parquet") 
    windowSpec = Window.partitionBy('A').orderBy('B')    
    df.select(rank().over(window), min('C').over(window)).show()
    sc.stop()
except Exception, e:
    print(str(e))
    print(sqlContext._get_hive_ctx())
    sc.stop()

1 个答案:

答案 0 :(得分:0)

我发现您的配置错过了一步(复制hive-site.xml):
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

  

通过将hive-site.xml文件放在conf /中来完成Hive的配置。请注意,在YARN群集(纱线群集模式)上运行查询时,需要在驱动程序和YARN启动的所有执行程序上提供lib_managed / jars目录下的datanucleus jar和conf /目录下的hive-site.xml。簇。执行此操作的便捷方法是通过spark-submit命令的--jars选项和--file选项添加它们。