我试图覆盖HBase方法:MultiTableInputFormat.getSplits() 我有这样的实现:
public List<InputSplit> getSplits(JobContext context) throws IOException {
List<Scan> scans = getScans();
List<InputSplit> splits = new ArrayList<>();
Scan sampleScan = scans.get(0);
byte[] tableNameBytes = sampleScan.getAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME);
TableName tableName = TableName.valueOf(tableNameBytes);
Table table = null;
RegionLocator regionLocator = null;
Connection conn = null;
conn = ConnectionFactory.createConnection(context.getConfiguration());
table = conn.getTable(tableName);
regionLocator = conn.getRegionLocator(tableName);
regionLocator = (RegionLocator) table;
Pair<byte[][], byte[][]> keys = regionLocator.getStartEndKeys();
RegionSizeCalculator sizeCalculator = new RegionSizeCalculator(
regionLocator, conn.getAdmin()
);
int regionCount = keys.getFirst().length;
for (int i = 0; i < regionCount; i++) {
calculateSplits(
keys.getFirst()[i],
keys.getSecond()[i],
regionLocator,
sizeCalculator,
splits
);
}
return splits;
}
private void calculateSplits(
final byte[] startKey,
final byte[] endKey,
RegionLocator regionLocator,
RegionSizeCalculator sizeCalculator,
List<InputSplit> splits
) throws IOException {
HRegionLocation hregionLocation = regionLocator.getRegionLocation(startKey, false);
String regionHostname = hregionLocation.getHostname();
HRegionInfo regionInfo = hregionLocation.getRegionInfo();
for (Scan scan : getScans()) {
byte[] startRow = scan.getStartRow();
byte[] stopRow = scan.getStopRow();
// determine if the given start and stop keys fall into the range
if (
(startRow.length == 0 || endKey.length == 0 || Bytes.compareTo(startRow, endKey) < 0) &&
(stopRow.length == 0 || Bytes.compareTo(stopRow, startKey) > 0)
) {
byte[] splitStart = startRow.length == 0 || Bytes.compareTo(startKey, startRow) >= 0 ?
startKey : startRow;
byte[] splitStop =
(stopRow.length == 0 || Bytes.compareTo(endKey, stopRow) <= 0) && endKey.length > 0 ?
endKey : stopRow;
long regionSize = sizeCalculator.getRegionSize(regionInfo.getRegionName());
TableSplit split = new TableSplit(
regionLocator.getName(), scan, splitStart, splitStop, regionHostname, regionSize
);
splits.add(split);
}
}
}
这段代码的基本思想是获取所有区域及其开始和结束键。我们还有一个扫描列表。我们将检查所有扫描*所有区域以获得所有分割。 但是这段代码非常慢,主要是因为我们有大约10,000个区域。因此,扫描和计算每个地区信息的过程将花费大量时间。
我注意到在regionLocator中我们还有一个名为的方法: getAllRegionLocations() 我想我可以使用这种方法一次获取所有区域并节省大量时间。 但问题是如果我使用这种方法,我无法获得相应的开始和结束键,那么我就无法决定分割的范围。 是否有更好的解决方案可以让这种方法更快?
答案 0 :(得分:0)
解决! 我发现我们可以从regionInfo获取startkey和endkey。 因此,首先获取一个列表,扫描列表中的所有regionLocation,然后第二个方法变为:
private void calculateSplits(
HRegionLocation hRegionLocation,
RegionLocator regionLocator,
RegionSizeCalculator sizeCalculator,
List<InputSplit> splits
) throws IOException {
String regionHostname = hRegionLocation.getHostname();
HRegionInfo regionInfo = hRegionLocation.getRegionInfo();
final byte[] startKey = regionInfo.getStartKey();
final byte[] endKey = regionInfo.getEndKey();
...
}