HBase

时间:2015-10-14 13:45:03

标签: hadoop apache-spark

我们尝试测试以下用于访问HBase表的示例代码(Spark-1.3.1,HBase-1.1.1,Hadoop-2.7.0):

import sys

from pyspark import SparkContext

if __name__ == "__main__":

    if len(sys.argv) != 3:
        print >> sys.stderr, """
        Usage: hbase_inputformat <host> <table>
        Run with example jar:
        ./bin/spark-submit --driver-class-path /path/to/example/jar \
        /path/to/examples/hbase_inputformat.py <host> <table>
        Assumes you have some data in HBase already, running on <host>, in <table>
        """
        exit(-1)

    host = sys.argv[1]
    table = sys.argv[2]
    sc = SparkContext(appName="HBaseInputFormat")

    conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
    keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
    valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"

    hbase_rdd = sc.newAPIHadoopRDD(
        "org.apache.hadoop.hbase.mapreduce.TableInputFormat",
        "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
        "org.apache.hadoop.hbase.client.Result",
        keyConverter=keyConv,
        valueConverter=valueConv,
        conf=conf)
    output = hbase_rdd.collect()
    for (k, v) in output:
        print (k, v)

    sc.stop() 

我们收到以下错误:

  

15/10/14 12:46:24 INFO BlockManagerMaster:已注册的BlockManager       Traceback(最近一次调用最后一次):         文件&#34; /opt/python/son.py" ;,第30行,in           CONF = CONF)         文件&#34; /usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py",第547行,在newAPIHadoopRDD中           jconf,batchSize)         文件&#34; /usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",第538行, call < /强>         文件&#34; /usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py" ;,第300行,在get_return_value中       py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD时发生错误。       :java.lang.ClassNotFoundException:org.apache.hadoop.hbase.io.ImmutableBytesWritable               在java.net.URLClassLoader $ 1.run(URLClassLoader.java:366)               在java.net.URLClassLoader $ 1.run(URLClassLoader.java:355)               at java.security.AccessController.doPrivileged(Native Method)               在java.net.URLClassLoader.findClass(URLClassLoader.java:354)               at java.lang.ClassLoader.loadClass(ClassLoader.java:425)               at java.lang.ClassLoader.loadClass(ClassLoader.java:358)               at java.lang.Class.forName0(Native Method)               at java.lang.Class.forName(Class.java:278)               在org.apache.spark.util.Utils $ .classForName(Utils.scala:157)               在org.apache.spark.api.python.PythonRDD $ .newAPIHadoopRDDFromClassNames(PythonRDD.scala:509)               在org.apache.spark.api.python.PythonRDD $ .newAPIHadoopRDD(PythonRDD.scala:494)               在org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala)               at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)               at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)               at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)               at java.lang.reflect.Method.invoke(Method.java:606)               在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)               在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)               在py4j.Gateway.invoke(Gateway.java:259)               at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)               在py4j.commands.CallCommand.execute(CallCommand.java:79)               在py4j.GatewayConnection.run(GatewayConnection.java:207)               在java.lang.Thread.run(Thread.java:745)

非常感谢任何见解。

1 个答案:

答案 0 :(得分:0)

发生错误是因为您没有在类路径中获得HBase库。您将需要hbase-common和hbase-client jar,您应该通过--jars参数传递给pyspark