在HBase上使用Spark进行聚合所需的时间过长(使用shc 1.1.1-2.1-s_2.11)

时间:2018-10-23 08:42:36

标签: apache-spark apache-spark-sql hbase

我正在将流数据摄取到hbase中。我已经通过Kafka分区预先分割了HBase表。将复合行键与Kafka分区,时间戳记以及其他一些列组合使用,以使其具有唯一性。 通过这种方法,我在插入数据时具有非常好的吞吐量,但是我还必须每天汇总数据,这非常慢。

我们观察到为groupBycount触发的Spark任务的数量等于分布我的表的区域总数。

我在这里做错什么了吗?如何限制HBase中表的区域数?

HBase创建语句

create 'default:test', {NAME => 'data', VERSIONS => 1, TTL => '3888000'},{SPLITS=> ['10000000000000000000000000000000','20000000000000000000000000000000','30000000000000000000000000000000','40000000000000000000000000000000','50000000000000000000000000000000','60000000000000000000000000000000','70000000000000000000000000000000','80000000000000000000000000000000','90000000000000000000000000000000']}

插入时的目录

def catalog = s"""{
  |"table":{"namespace":"default", "name": "test", "tableCoder":"Phoenix"},
  |"rowkey":"key",
  |"columns":{
  |"rowkey":{"cf":"rowkey", "col":"key", "type":"string"},
  |"resource_id":{"cf":"data", "col":"resource_id", "type":"string"},
  |"resource_name":{"cf":"data", "col":"resource_name", "type":"string"},
  |"parent_id":{"cf":"data", "col":"parent_id", "type":"string"},  
  |"parent_name":{"cf":"data", "col":"parent_name", "type":"string"},
  |"id":{"cf":"data", "col":"id", "type":"string"},
  |"name":{"cf":"data", "col":"name", "type":"string"},
  |"timestamp":{"cf":"data", "col":"timestamp", "type":"string"},
  |"readable_timestamp":{"cf":"data", "col":"readable_timestamp", 
    "type":"string"},
  |"value":{"cf":"data", "col":"value", "type":"string"},
  |"partition":{"cf":"data", "col":"partition", "type":"string"}
  |}
|}""".stripMargin

要阅读的目录

def catalog = s"""{
    |"table":{"namespace":"default", "name": "test", "tableCoder":"Phoenix"},
    |"rowkey":"partition:timestamp:id:parent_id:resource_id",
    |"columns":{   
    |"partition":{"cf":"rowkey", "col":"partition", "type":"string"},
    |"timestamp":{"cf":"rowkey", "col":"timestamp", "type":"string"},
    |"id":{"cf":"rowkey", "col":"id", "type":"string"},
    |"parent_id":{"cf":"rowkey", "col":"parent_id", "type":"string"},
    |"resource_id":{"cf":"rowkey", "col":"resource_id", "type":"string"},
    |"resource_name":{"cf":"data", "col":"resource_name", "type":"string"}, 
    |"parent_name":{"cf":"data", "col":"parent_name", "type":"string"},
    |"name":{"cf":"data", "col":"name", "type":"string"},
    |"value":{"cf":"data", "col":"value", "type":"string"},
    |"readable_timestamp":{"cf":"data", "col":"readable_timestamp", "type":"string"}
    |}
    |}""".stripMargin

要进行范围扫描,我正在使用所有分区和时间范围:

val endrow = "1540080000000"
val startrow = "1539993600000"

df.filter(($"partition"==="0" && ($"timestamp" >= startrow && $"timestamp" <= endrow)) ||($"partition"==="1" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="2" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="3" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="4" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="5" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="6" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="7" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="8" && ($"timestamp" >= startrow && $"timestamp" <= endrow))||($"partition"==="9" && ($"timestamp" >= startrow && $"timestamp" <= endrow))).count

上面的过滤器执行830个任务,这些任务等于区域数。 这花费了太多时间。我该如何改善呢?

0 个答案:

没有答案