执行hbase扫描时出现异常

时间:2018-05-10 10:51:48

标签: apache-spark hadoop hbase apache-zookeeper

我正在尝试hbase spark distributed scan example

我的简单代码如下所示:

Approve

我遇到以下异常:

public class DistributedHBaseScanToRddDemo {

    public static void main(String[] args) {
        JavaSparkContext jsc = getJavaSparkContext("hbasetable1");
        Configuration hbaseConf = getHbaseConf(0, "", "");
        JavaHBaseContext javaHbaseContext = new JavaHBaseContext(jsc, hbaseConf);

        Scan scan = new Scan();
        scan.setCaching(100);

        JavaRDD<Tuple2<ImmutableBytesWritable, Result>> javaRdd =
                  javaHbaseContext.hbaseRDD(TableName.valueOf("hbasetable1"), scan);

        List<String> results = javaRdd.map(new ScanConvertFunction()).collect();
        System.out.println("Result Size: " + results.size());
    }

    public static Configuration getHbaseConf(int pRimeout, String pQuorumIP, String pClientPort)
    {
        Configuration hbaseConf = HBaseConfiguration.create();
        hbaseConf.setInt("timeout", 120000); 
        hbaseConf.set("hbase.zookeeper.quorum", "10.56.36.14"); 
        hbaseConf.set("hbase.zookeeper.property.clientPort", "2181");
        return hbaseConf;
    }

    public static JavaSparkContext getJavaSparkContext(String pTableName)
    {
        SparkConf sparkConf = new SparkConf().setAppName("JavaHBaseBulkPut" + pTableName);
        sparkConf.setMaster("local");
        sparkConf.set("spark.testing.memory", "471859200");
        JavaSparkContext jsc = new JavaSparkContext(sparkConf);

        return jsc;
    }

    private static class ScanConvertFunction implements Function<Tuple2<ImmutableBytesWritable, Result>, String> {
        public String call(Tuple2<ImmutableBytesWritable, Result> v1) throws Exception {
            return Bytes.toString(v1._1().copyBytes());
        }
    }
}

我还尝试了批量getput示例,它们正常运行。所以我猜测批量扫描示例出了什么问题。

1 个答案:

答案 0 :(得分:1)

这个Cloudera hbase-spark连接器似乎可以工作:

https://mvnrepository.com/artifact/org.apache.hbase/hbase-spark?repo=cloudera

所以,在pom.xml中添加这样的东西:

 <repositories>
    <repository>
      <id>cloudera</id>
      <name>cloudera</name>
      <url>https://repository.cloudera.com/content/repositories/releases/</url>
    </repository>
  </repositories>

和依赖:

 <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-spark</artifactId>
      <version>${hbase-spark.version}</version>
    </dependency>

我注意到的一件事是,这个功能似乎没有很好地重用HBase连接,并试图为每个分区重新建立它。请在此处查看我的问题和相关讨论:

HBase-Spark Connector: connection to HBase established for every scan?

出于这个原因,我实际上避免了这个功能,但很想知道你对此的体验。