HBase,有没有更快的方法来获取所有区域及其相应的开始和结束键?

时间:2015-08-24 18:20:18

标签: java hbase

我试图覆盖HBase方法:MultiTableInputFormat.getSplits() 我有这样的实现:

public List<InputSplit> getSplits(JobContext context) throws IOException {
    List<Scan> scans = getScans();
    List<InputSplit> splits = new ArrayList<>();
    Scan sampleScan = scans.get(0);
    byte[] tableNameBytes = sampleScan.getAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME);

    TableName tableName = TableName.valueOf(tableNameBytes);
    Table table = null;
    RegionLocator regionLocator = null;
    Connection conn = null;
      conn = ConnectionFactory.createConnection(context.getConfiguration());
      table = conn.getTable(tableName);
      regionLocator = conn.getRegionLocator(tableName);
      regionLocator = (RegionLocator) table;
      Pair<byte[][], byte[][]> keys = regionLocator.getStartEndKeys();

      RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(
        regionLocator, conn.getAdmin()
      );
      int regionCount = keys.getFirst().length;

      for (int i = 0; i < regionCount; i++) {
        calculateSplits(
          keys.getFirst()[i],
          keys.getSecond()[i],
          regionLocator,
          sizeCalculator,
          splits
        );
      }
    return splits;
  }

  private void calculateSplits(
    final byte[] startKey,
    final byte[] endKey,
    RegionLocator regionLocator,
    RegionSizeCalculator sizeCalculator,
    List<InputSplit> splits
  ) throws IOException {
    HRegionLocation hregionLocation = regionLocator.getRegionLocation(startKey, false);
    String regionHostname = hregionLocation.getHostname();
    HRegionInfo regionInfo = hregionLocation.getRegionInfo();

    for (Scan scan : getScans()) {
      byte[] startRow = scan.getStartRow();
      byte[] stopRow = scan.getStopRow();

      // determine if the given start and stop keys fall into the range
      if (
        (startRow.length == 0 || endKey.length == 0 || Bytes.compareTo(startRow, endKey) < 0) &&
        (stopRow.length == 0 || Bytes.compareTo(stopRow, startKey) > 0)
        ) {
        byte[] splitStart = startRow.length == 0 || Bytes.compareTo(startKey, startRow) >= 0 ?
          startKey : startRow;
        byte[] splitStop =
          (stopRow.length == 0 || Bytes.compareTo(endKey, stopRow) <= 0) && endKey.length > 0 ?
            endKey : stopRow;

        long regionSize = sizeCalculator.getRegionSize(regionInfo.getRegionName());
        TableSplit split = new TableSplit(
          regionLocator.getName(), scan, splitStart, splitStop, regionHostname, regionSize
        );
        splits.add(split);
      }
    }
  }

这段代码的基本思想是获取所有区域及其开始和结束键。我们还有一个扫描列表。我们将检查所有扫描*所有区域以获得所有分割。 但是这段代码非常慢,主要是因为我们有大约10,000个区域。因此,扫描和计算每个地区信息的过程将花费大量时间。

我注意到在regionLocator中我们还有一个名为的方法: getAllRegionLocations() 我想我可以使用这种方法一次获取所有区域并节省大量时间。 但问题是如果我使用这种方法,我无法获得相应的开始和结束键,那么我就无法决定分割的范围。 是否有更好的解决方案可以让这种方法更快?

1 个答案:

答案 0 :(得分:0)

解决! 我发现我们可以从regionInfo获取startkey和endkey。 因此,首先获取一个列表,扫描列表中的所有regionLocation,然后第二个方法变为:

private void calculateSplits(
    HRegionLocation hRegionLocation,
    RegionLocator regionLocator,
    RegionSizeCalculator sizeCalculator,
    List<InputSplit> splits
  ) throws IOException {
    String regionHostname = hRegionLocation.getHostname();
    HRegionInfo regionInfo = hRegionLocation.getRegionInfo();
    final byte[] startKey = regionInfo.getStartKey();
    final byte[] endKey = regionInfo.getEndKey();
    ...
}