我们尝试测试以下用于访问HBase表的示例代码(Spark-1.3.1,HBase-1.1.1,Hadoop-2.7.0):
import sys
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 3:
print >> sys.stderr, """
Usage: hbase_inputformat <host> <table>
Run with example jar:
./bin/spark-submit --driver-class-path /path/to/example/jar \
/path/to/examples/hbase_inputformat.py <host> <table>
Assumes you have some data in HBase already, running on <host>, in <table>
"""
exit(-1)
host = sys.argv[1]
table = sys.argv[2]
sc = SparkContext(appName="HBaseInputFormat")
conf = {"hbase.zookeeper.quorum": host, "hbase.mapreduce.inputtable": table}
keyConv = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConv = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConv,
valueConverter=valueConv,
conf=conf)
output = hbase_rdd.collect()
for (k, v) in output:
print (k, v)
sc.stop()
我们收到以下错误:
15/10/14 12:46:24 INFO BlockManagerMaster:已注册的BlockManager Traceback(最近一次调用最后一次): 文件&#34; /opt/python/son.py" ;,第30行,in CONF = CONF) 文件&#34; /usr/hdp/2.3.0.0-2557/spark/python/pyspark/context.py",第547行,在newAPIHadoopRDD中 jconf,batchSize) 文件&#34; /usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",第538行, call < /强> 文件&#34; /usr/hdp/2.3.0.0-2557/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py" ;,第300行,在get_return_value中 py4j.protocol.Py4JJavaError:调用z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD时发生错误。 :java.lang.ClassNotFoundException:org.apache.hadoop.hbase.io.ImmutableBytesWritable 在java.net.URLClassLoader $ 1.run(URLClassLoader.java:366) 在java.net.URLClassLoader $ 1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) 在java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) 在org.apache.spark.util.Utils $ .classForName(Utils.scala:157) 在org.apache.spark.api.python.PythonRDD $ .newAPIHadoopRDDFromClassNames(PythonRDD.scala:509) 在org.apache.spark.api.python.PythonRDD $ .newAPIHadoopRDD(PythonRDD.scala:494) 在org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) 在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) 在py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 在py4j.commands.CallCommand.execute(CallCommand.java:79) 在py4j.GatewayConnection.run(GatewayConnection.java:207) 在java.lang.Thread.run(Thread.java:745)
非常感谢任何见解。
答案 0 :(得分:0)
发生错误是因为您没有在类路径中获得HBase库。您将需要hbase-common和hbase-client jar,您应该通过--jars参数传递给pyspark