spark scan hbase:扫描列会降低效率吗?

时间:2016-01-06 14:18:06

标签: scala apache-spark hbase

今天,我用spark来扫描Hbase。我的Hbase有一个名为" cf"的列族," cf"中有25列。我想扫描列的onf,例如:column8。所以我设置了Hbase conf:

val myConf = HBaseConfiguration.create()
 myConf.set("hbase.zookeeper.quorum", "compute000,compute001,compute002")
     myConf.set("hbase.master", "10.10.10.10:60000")
     myConf.set("hbase.zookeeper.property.clientPort", "2181")
     myConf.set("hbase.defaults.for.version.skip", "true")
     myConf.set(TableInputFormat.INPUT_TABLE, table)
     myConf.set(TableInputFormat.SCAN_COLUMNS, "cf:column8")
     val hbaseRDD = sc.newAPIHadoopRDD(myConf, classOf[TableInputFormat],
            classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
            classOf[org.apache.hadoop.hbase.client.Result])
     val newHbaseRDD = hbaseRDD.map { case (_, result) =>
              Array( Bytes.toString(result.getValue(cf.getBytes, i.getBytes)).toDouble)
            }

          newHbaseRDD  //Array[Double]

它需要 30分钟,但是,如果我没有设置SCAN_COLUMNS ,它只需要 4分钟。< / p>

出了什么问题,我不应该设置参数&#39; SCAN_COLUMNS&#39;?

你可以帮助我吗,很多。

更新

    val scanColumn = "cf:auth_fail_numbers cf:auth_numbers cf:cpu_use_rate"+
  " cf:harddisk_use_rate cf:http_req_fail_num cf:http_req_num cf:memory_use_rate" +
  " cf:multi_abend_numbers cf:multi_fail_numbers cf:multi_req_numbers" +
  " cf:vod_fail_numbers cf:vod_req_numbers"
  myConf.set(TableInputFormat.SCAN_COLUMNS, scanColumn)

当我使用此代码时,将一些列进行扫描,应用程序会导致错误:

     ERROR TaskSetManager: Task 22 in stage 0.0 failed 4 times; 
aborting job
    Exception in thread "main" org.apache.spark.SparkException: 
Job aborted due to stage failure: Task 22 in stage 0.0 failed 4 times, 
most recent failure: Lost task 22.3 in stage 0.0 (TID 234, compute031): java.lang.NullPointerException
    at no1.no1$$anonfun$9.apply(no1.scala:137)
    at no1.no1$$anonfun$9.apply(no1.scala:137)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
    at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:201)
    at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
    at org.apache.spark.scheduler.Task.run(Task.scala:70)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler
$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

0 个答案:

没有答案