今天,我用spark来扫描Hbase。我的Hbase有一个名为" cf"的列族," cf"中有25列。我想扫描列的onf,例如:column8。所以我设置了Hbase conf:
val myConf = HBaseConfiguration.create()
myConf.set("hbase.zookeeper.quorum", "compute000,compute001,compute002")
myConf.set("hbase.master", "10.10.10.10:60000")
myConf.set("hbase.zookeeper.property.clientPort", "2181")
myConf.set("hbase.defaults.for.version.skip", "true")
myConf.set(TableInputFormat.INPUT_TABLE, table)
myConf.set(TableInputFormat.SCAN_COLUMNS, "cf:column8")
val hbaseRDD = sc.newAPIHadoopRDD(myConf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
val newHbaseRDD = hbaseRDD.map { case (_, result) =>
Array( Bytes.toString(result.getValue(cf.getBytes, i.getBytes)).toDouble)
}
newHbaseRDD //Array[Double]
它需要 30分钟,但是,如果我没有设置SCAN_COLUMNS ,它只需要 4分钟。< / p>
出了什么问题,我不应该设置参数&#39; SCAN_COLUMNS&#39;?
你可以帮助我吗,很多。
val scanColumn = "cf:auth_fail_numbers cf:auth_numbers cf:cpu_use_rate"+
" cf:harddisk_use_rate cf:http_req_fail_num cf:http_req_num cf:memory_use_rate" +
" cf:multi_abend_numbers cf:multi_fail_numbers cf:multi_req_numbers" +
" cf:vod_fail_numbers cf:vod_req_numbers"
myConf.set(TableInputFormat.SCAN_COLUMNS, scanColumn)
当我使用此代码时,将一些列进行扫描,应用程序会导致错误:
ERROR TaskSetManager: Task 22 in stage 0.0 failed 4 times;
aborting job
Exception in thread "main" org.apache.spark.SparkException:
Job aborted due to stage failure: Task 22 in stage 0.0 failed 4 times,
most recent failure: Lost task 22.3 in stage 0.0 (TID 234, compute031): java.lang.NullPointerException
at no1.no1$$anonfun$9.apply(no1.scala:137)
at no1.no1$$anonfun$9.apply(no1.scala:137)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:201)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)