我正在尝试使用Hortonworks SPARK-ON-HBASE connector从Spark读取来自kerberized HBASE实例的数据。我的群集配置基本上类似于this:我将我的spark作业从客户端计算机提交到远程Spark独立群集,并且该作业正在尝试从单独的 HBASE群集中读取数据。
如果我通过直接在我的客户端上运行带有master = local [*]的Spark来绕过独立群集,只要我第一次从客户端开始,就可以访问远程HBASE群集。但是,当我将master作为远程集群设置为所有其他配置相同时,我在org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:43)
收到空指针异常(下面的完整堆栈跟踪)
有没有人完成了类似的架构,或许可以借出一些 洞察力?尽管日志中没有提到有关身份验证问题的任何内容,但我还是预感到,当从非kerberized Spark群集访问HBASE时,我可能会遇到身份验证问题。
完整堆栈跟踪:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0: java.lang.NullPointerException
at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:43)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.init(HBaseResources.scala:125)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.liftedTree1$1(HBaseResources.scala:57)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.acquire(HBaseResources.scala:54)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.acquire(HBaseResources.scala:120)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:74)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.releaseOnException(HBaseResources.scala:120)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.getScanner(HBaseResources.scala:144)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:267)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:266)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:212)
at org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165)
at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498)
at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375)
at org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:170)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.security.UserProvider.instantiate(UserProvider.java:43)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:214)
at org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:119)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.init(HBaseResources.scala:125)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.liftedTree1$1(HBaseResources.scala:57)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.acquire(HBaseResources.scala:54)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.acquire(HBaseResources.scala:120)
at org.apache.spark.sql.execution.datasources.hbase.ReferencedResource$class.releaseOnException(HBaseResources.scala:74)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.releaseOnException(HBaseResources.scala:120)
at org.apache.spark.sql.execution.datasources.hbase.TableResource.getScanner(HBaseResources.scala:144)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:267)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableScanRDD$$anonfun$7.apply(HBaseTableScan.scala:266)
at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
答案 0 :(得分:0)
您遇到配置问题,其中hbase.client.userprovider.class配置不可用。您需要确保hbase libs和conf文件位于spark执行器的路径上。
private static final String USER_PROVIDER_CONF_KEY = "hbase.client.userprovider.class";
/**
* Instantiate the {@link UserProvider} specified in the configuration and set the passed
* configuration via {@link UserProvider#setConf(Configuration)}
* @param conf to read and set on the created {@link UserProvider}
* @return a {@link UserProvider} ready for use.
*/
public static UserProvider instantiate(Configuration conf) {
Class<? extends UserProvider> clazz =
conf.getClass(USER_PROVIDER_CONF_KEY, UserProvider.class, UserProvider.class);
return ReflectionUtils.newInstance(clazz, conf);
}
答案 1 :(得分:0)
我偶然发现了这个症状(但根本原因可能不一样),并找到了一个你可能不想尝试的 非常脏的解决方法 。
$$ Context $$ Cloudera 发行版,HBase 1.2.0-CDH5.7.0
java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Scan.setCaching(I)Lorg/apache/hadoop/hbase/client/Scan;
$$解决方法#1 $$
/etc/hbase/conf/
)与Spark-HBase JAR一起添加到spark.driver.extraClassPath
。--jars
和Spark-HBase JAR发送这些JAR到执行程序(如果conf,请不要忘记/etc/hbase/conf/
中的目录spark.executor.extraClassPath
存在于所有YARN节点上;否则找到将XML发送到其容器CLASSPATH中的目录的方法)
org.apache.hadoop.hbase.security.UserProvider.instantiate(Configuration)
,因此org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(Configuration, boolean, ExecutorService, User)
$$解决方法#2 $$
java.lang.NullPointerException
参数为NULL,就会通过调用conf
静默替换它org.apache.hadoop.hbase.HBaseConfiguration.create()
可执行文件修补Spark-HBase插件肯定会更有意义(参见that post中 ray3888 的评论)但是Scala让我呕吐所以我坚持使用plain'old Java。