火花过滤器hbase来获取样品

时间:2016-09-05 14:21:05

标签: scala apache-spark hbase

我有一个大的hbase数据集,我想检索一些具有特定条件的数据,以便将样本集调试为。

这是Spark RDD和过滤器:

val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "172.16.1.10,172.16.1.11,172.16.1.12")
conf.setInt("timeout", 120000)
conf.set(TableInputFormat.INPUT_TABLE, "dateset")
val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val filteredRDD = rdd.filter{
   case tuple => {
      val result = tuple._2
      val hostId = new String(result.getValue("user", "id"))
      hostId == "12345" // <-- only retrieve the row when user:id is 12345
   }
}

现在我得到了filteredRDD的rdd,我只想将其保存为另一个表格中的相同格式

val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "172.16.1.10,172.16.1.11,172.16.1.12")
conf.setInt("timeout", 120000)
conf.set(TableOutputFormat.OUTPUT_TABLE, "data_sample")
// here I don't know which api to use

有人能给我一些线索吗?感谢。

1 个答案:

答案 0 :(得分:0)

您可以在pair rdd上使用 saveAsNewAPIHadoopDataset 功能,如下所示:

JavaPairRDD<ImmutableBytesWritable, Put> pairs = filterRDD.mapToPair(new PairFunction<Object, ImmutableBytesWritable, Put>() {
    @Override
    public Tuple2<ImmutableBytesWritable, Put> execute(Object message) throws Exception {

        //prepare the HBase put here
        return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
    }
});

Job newAPIJobConfiguration = Job.getInstance(hbaseConf);
newAPIJobConfiguration.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableName);
newAPIJobConfiguration.setOutputFormatClass(org.apache.hadoop.hbase.mapreduce.TableOutputFormat.class);

pairs.saveAsNewAPIHadoopDataset(newAPIJobConfiguration.getConfiguration());