Question

我正在考虑以下情况。我每天都会发送一个数据文件。我将其添加到名为file-yyyyMMdd格式的HBase中。所以在一段时间内我有很多数据库，例如

tempdb-20121220
tempdb-20121221
tempdb-20121222
tempdb-20121223
tempdb-20121224
tempdb-20121225

现在我要做的是针对特定日期范围获取列表，如果表匹配该范围，以便我可以创建索引。我使用的是hbase-0.90.6

就我的研究而言，TableMapReduceUtil.initTableMapperJob仅占用1个tableName。

TableMapReduceUtil.initTableMapperJob(
tableName,        // input HBase table name
scan,             // Scan instance to control CF and attribute selection
HBaseIndexerMapper.class,   // mapper
null,             // mapper output key
null,             // mapper output value
job
);

我已经能够获取表的列表并在循环中运行它，但我的想法是我可以遍历所有表，扫描它（或其他东西），这样我最终可以得到合并/组合结果用于索引目的。

实现这一目标的任何方向都会很有帮助。

Answer 1

好的，请检查HBase 0.94.6来源（看起来他们离您最近）。在那里，您会找到MultiTableInputFormat class（按照链接查看JavaDoc，包括示例），它可以满足您的需求。就在几天前，我有经验将此课程添加到HBase 0.94.2（实际上是CDH 4.2.1）的项目中。全成。

这似乎完全符合您的需求并且非常有效。这里唯一的问题是你将有一个映射器处理所有数据。为了区分表，您可能需要从0.94.6获取TableSplit类，重命名它并使端口不破坏您的环境。请检查TableMapReduceUtil中的差异 - 您需要手动配置扫描，这样输入格式才能理解它们的配置。

还可以考虑简单地转移到HBase 0.94.6 - 更简单的方式，因为我无法遵循它。我花了大约12个工作小时来理解这里的问题/调查解决方案/了解我的问题与CDH 4.2.1 /端口的一切。对我来说，好消息是Cloudera打算在CDH 4.3.0中移动到0.94.6。

UPDATE1：CDH 4.3.0可用，它包含HBase 0.94.6以及所有必需的基础设施。

UPDATE2：我转移到其他解决方案 - 自定义输入格式，它结合了几个HBase表，按键混合它们的行。发生了非常有用，特别是在正确的键设计。您可以在单个映射器中获得整个聚合。我正在考虑在github上发布这段代码。

Answer 2

List<scans>也是一种方式。我同意MultipleTableInputFormat：

import java.util.List; 
import org.apache.hadoop.conf.Configuration; 
import org.apache.hadoop.conf.Configured; 
import org.apache.hadoop.hbase.client.Scan; 
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; 
import org.apache.hadoop.hbase.util.Bytes; 
import org.apache.hadoop.io.IntWritable; 
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.util.Tool; 

 public class TestMultiScan extends Configured implements Tool { 

    @Override 
    public int run(String[] arg0) throws Exception { 
        List<Scan> scans = new ArrayList<Scan>(); 


        Scan scan1 = new Scan(); 
        scan1.setAttribute("scan.attributes.table.name", Bytes.toBytes("table1ddmmyyyy")); 
        System.out.println(scan1.getAttribute("scan.attributes.table.name")); 
        scans.add(scan1); 


        Scan scan2 = new Scan(); 
        scan2.setAttribute("scan.attributes.table.name", Bytes.toBytes("table2ddmmyyyy")); 
        System.out.println(scan2.getAttribute("scan.attributes.table.name")); 
        scans.add(scan2); 


        Configuration conf = new Configuration(); 
        Job job = new Job(conf);     
        job.setJarByClass(TestMultiScan.class); 


        TableMapReduceUtil.initTableMapperJob( 
                scans,  
                MultiTableMappter.class,  
                Text.class,  
                IntWritable.class,  
                job); 
        TableMapReduceUtil.initTableReducerJob( 
                "xxxxx", 
                MultiTableReducer.class,  
                job); 
        job.waitForCompletion(true); 
        return 0; 
    } 

    public static void main(String[] args) throws Exception { 
        TestMultiScan runJob = new TestMultiScan(); 
        runJob.run(args); 
    } 
 }

通过这种方式，我们用HBASE命名空间表解决了我们的多租户需求。例如：DEV1：TABLEX（由DEV1引入的数据）UAT1：TABLEX（UAT1中的数据消耗）我们希望比较两个命名空间表以进一步继续。

在内部，它使用了Multiple Table InputFormat，如TableMapReduceUtil.java所示

HBase多表扫描作业

2 个答案: