如何在使用spark处理hbase时获取rowkey

时间:2015-03-02 02:11:38

标签: java apache-spark hbase

我想扫描一个hbase表,我的代码如下。

public void start() throws IOException {
SparkConf conf = new SparkConf().setAppName("Simple Application");
JavaSparkContext sc = new JavaSparkContext(conf);


Configuration hbaseConf = HBaseConfiguration.create();

Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("0001"));
scan.setStopRow(Bytes.toBytes("0004"));
scan.addFamily(Bytes.toBytes("DATA"));
scan.addColumn(Bytes.toBytes("info"), Bytes.toBytes("TIME"));
ClientProtos.Scan proto = ProtobufUtil.toScan(scan);   

String scanStr = Base64.encodeBytes(proto.toByteArray()); 

String tableName = "rdga_by_id";
hbaseConf.set(TableInputFormat.INPUT_TABLE, tableName);
hbaseConf.set(TableInputFormat.SCAN, scanStr); 

JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD = sc.newAPIHadoopRDD(hbaseConf,TableInputFormat.class, ImmutableBytesWritable.class, Result.class);


System.out.println("here: " + hBaseRDD.count());


PairFunction<Tuple2<ImmutableBytesWritable, Result>, Integer, Integer> pairFunc = 
        new PairFunction<Tuple2<ImmutableBytesWritable, Result>, Integer, Integer>() {
    @Override
    public Tuple2<Integer, Integer> call(Tuple2<ImmutableBytesWritable, Result> immutableBytesWritableResultTuple2) throws Exception {

        byte[] time = immutableBytesWritableResultTuple2._2().getValue(Bytes.toBytes("DATA"), Bytes.toBytes("TIME"));
        byte[] id = /* I want to get Row Key here */
        if (time != null && id != null) {
            return new Tuple2<Integer, Integer>(byteArrToInteger(id), byteArrToInteger(time));
        }
        else {
            return null;
        }
    }
};

现在我想得到每个结果的行键。但我只能在扫描中设置族和列。我怎样才能获得行密钥?是否有任何函数或方法如result.getRowkey()可以与JavaPairRDD一起使用?或者我应该如何设置Scan以便在结果中保留行键?

提前致谢!

1 个答案:

答案 0 :(得分:1)

结果已包含您的行。实际上你的行键是ImmutableBytesWritable。您只需将其再次转换为String,如:

String rowKey = new String(immutableBytesWritableResultTuple2._1.get());

我不确定您使用的是哪个版本的Spark。在版本为1.2.0的spark-core_2.10中,“newAPIHadoopRDD”方法不返回JavaPairRDD,调用会产生如下代码:

RDD<Tuple2<ImmutableBytesWritable, Result>> hBaseRDD = sc.newAPIHadoopRDD(hbaseConf,TableInputFormat.class, ImmutableBytesWritable.class, Result.class);

然而,“hbaseRDD”然后提供了在必要时将其转换为JavaRDD的函数:

hBaseRDD.toJavaRDD();

然后您可以使用“.mapToPair”方法并使用您定义的函数。