我想从 pyspark 将数据插入 Hbase 。下面是我实现的代码,当我尝试将数据写入Hbase时得到NullpointerException
。
pyspark-主纱线-部署模式客户端--jars hbase-spark-2.1.0-cdh6.1.0.jar-驱动程序类路径 hbase-spark-2.1.0-cdh6.1.0.jar
代码:
from pyspark import SparkContext
from pyspark.sql import SQLContext
sc = SparkContext()
sqlc = SQLContext(sc)
data_source_format = 'org.apache.hadoop.hbase.spark'
df = sc.parallelize([('a', '1.0'), ('b', '2.0')]).toDF(schema=['col0', 'col1'])
catalog = ''.join("""{
"table":{"namespace":"default", "name":"testtable"},
"rowkey":"key",
"columns":{
"col0":{"cf":"rowkey", "col":"key", "type":"string"},
"col1":{"cf":"cf", "col":"col1", "type":"string"}
}
}""".split())
df.write.options(catalog=catalog,newtable=5).format(data_source_format).save()
错误:
df.write.options(catalog = catalog,newtable = 5).format(data_source_format).save() 追溯(最近一次通话): 文件“”,第1行,位于 文件“ /app/hadoop/parcels/CDH-6.1.0- 保存中的1.cdh6.1.0.p0.770702 / lib / spark / python / pyspark / sql / readwriter.py“,第734行 self._jwrite.save() 文件“ /app/hadoop/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py”, 第1257行,在致电中 文件“ /app/hadoop/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/python/pyspark/sql/utils.py”, 第63行,在装饰中 返回f(* a,** kw) 文件“ /app/hadoop/parcels/CDH-6.1.0-1.cdh6.1.0.p0.770702/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py”, 第328行,位于get_return_value中 py4j.protocol.Py4JJavaError:调用o107.save时发生错误。 :java.lang.NullPointerException 在org.apache.hadoop.hbase.spark.HBaseRelation。(DefaultSource.scala:139) 在org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:79) 在org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult $ lzycompute(commands.scala:70) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) 在org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:131) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ execute $ 1.apply(SparkPlan.scala:127) 在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ executeQuery $ 1.apply(SparkPlan.scala:155) 在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) 在org.apache.spark.sql.execution.QueryExecution.toRdd $ lzycompute(QueryExecution.scala:80) 在org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 在org.apache.spark.sql.DataFrameWriter $$ anonfun $ runCommand $ 1.apply(DataFrameWriter.scala:668) 在org.apache.spark.sql.DataFrameWriter $$ anonfun $ runCommand $ 1.apply(DataFrameWriter.scala:668) 在org.apache.spark.sql.execution.SQLExecution $$ anonfun $ withNewExecutionId $ 1.apply(SQLExecution.scala:78) 在org.apache.spark.sql.execution.SQLExecution $ .withSQLConfPropagated(SQLExecution.scala:125) 在org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:73) 在org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668) 在org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276) 在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270) 在sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)处 在sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498) 在py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 在py4j.Gateway.invoke(Gateway.java:282) 在py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79) 在py4j.GatewayConnection.run(GatewayConnection.java:238) 在java.lang.Thread.run(Thread.java:748)
谢谢。