我在hbase中有十亿的行我想一次扫描百万行。什么是最好的优化技术,我可以做到尽可能快地进行扫描。
答案 0 :(得分:1)
我们有类似的问题,我们需要按键扫描数百万行,我们使用map reduce技术。没有标准的解决方案,因此我们编写了一个扩展InputFormat<ImmutableBytesWritable, Result>
的自定义输入格式。有一个镜头描述我们是如何做到这一点的。
首先,您需要创建一个拆分,以便密钥转到包含它的区域所在的机器:
public List<InputSplit> getSplits(JobContext context) throws IOException {
context.getConfiguration();
//read key for scan
byte[][] filterKeys = readFilterKeys(context);
if (table == null) {
throw new IOException("No table was provided.");
}
Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region.");
}
List<InputSplit> splits = new ArrayList<InputSplit>(keys.getFirst().length);
for (int i = 0; i < keys.getFirst().length; i++) {
//get key for current region
//it should lying between start and end key of region
byte[][] regionKeys =
getRegionKeys(keys.getFirst()[i], keys.getSecond()[i],filterKeys);
if (regionKeys == null) {
continue;
}
String regionLocation = table.getRegionLocation(keys.getFirst()[i]).
getServerAddress().getHostname();
//create a split for region
InputSplit split = new MultiplyValueSplit(table.getTableName(),
regionKeys, regionLocation);
splits.add(split);
}
return splits;
}
Class'MultiplyValueSplit'包含有关键和表的信息
public class MultiplyValueSplit extends InputSplit
implements Writable, Comparable<MultiplyValueSplit> {
private byte[] tableName;
private byte[][] keys;
private String regionLocation;
}
在输入格式类a的方法createRecordReader
中,'MultiplyValueReader'包含如何创建表的读取值的逻辑。
@Override
public RecordReader<ImmutableBytesWritable, Result> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException {
HTable table = this.getHTable();
if (table == null) {
throw new IOException("Cannot create a record reader because of a" +
" previous error. Please look at the previous logs lines from" +
" the task's full log for more details.");
}
MultiplyValueSplit mSplit = (MultiplyValueSplit) split;
MultiplyValuesReader mvr = new MultiplyValuesReader();
mvr.setKeys(mSplit.getKeys());
mvr.setHTable(table);
mvr.init();
return mvr;
}
Class'MultiplyValuesReader'包含有关如何从HTable读取数据的逻辑
public class MultiplyValuesReader
extends RecordReader<ImmutableBytesWritable, Result> {
.......
@Override
public ImmutableBytesWritable getCurrentKey() {
return key;
}
@Override
public Result getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (this.results == null) {
return false;
}
while (this.results != null) {
if (resultCurrentKey >= results.length) {
this.results = getNextResults();
continue;
}
if (key == null) key = new ImmutableBytesWritable();
value = results[resultCurrentKey];
resultCurrentKey++;
if (value != null && value.size() > 0) {
key.set(value.getRow());
return true;
}
}
return false;
}
public float getProgress() {
// Depends on the total number of tuples
return (keys.length > 0 ? ((float) currentKey) / keys.length : 0.0f);
}
private Result[] getNextResults() throws IOException {
if (currentKey <= keys.length) {
return null;
}
//using batch for faster scan
ArrayList<Get> batch = new ArrayList<Get>(BATCH_SIZE);
for (int i = currentKey;
i < Math.min(currentKey + BATCH_SIZE, keys.length); i++) {
batch.add(new Get(keys[i]));
}
currentKey = Math.min(currentKey + BATCH_SIZE, keys.length);
resultCurrentKey = 0;
return htable.get(batch);
}
}
有关详细信息,您可以查看类TableInputFormat
,TableInputFormatBase
,TableSplit
和TableRecordReader
的源代码。