AWS EMR HBase批量加载

时间:2018-09-04 09:04:29

标签: amazon-web-services hbase amazon-emr bulk-load

我开发了一个Map Reduce程序,使用Cloudera文章https://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/中介绍的技术进行HBase批量加载。

在我们之前的本地Cloudera Hadoop集群上,它运行良好。现在,我们正在迁移到AWS。我无法使该程序在AWS EMR集群上运行。

EMR详细信息:

  • 发布标签:emr-5.16.0
  • Hadoop发行版:Amazon 2.8.4
  • 应用程序:Spark 2.3.1,HBase 1.4.4
  • 主版:m4.4xlarge
  • 节点:12 x m4.4xlarge

这是我的驱动程序的代码

        Job job = Job.getInstance(getConf());
        job.setJobName("My job");
        job.setJarByClass(getClass());

        // Input
        FileInputFormat.setInputPaths(job, input);

        // Mapper
        job.setMapperClass(MyMapper.class);
        job.setInputFormatClass(ExampleInputFormat.class);
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Put.class);

        // Reducer : Auto configure partitioner and reducer
        Table table = HBaseCnx.getConnection().getTable(TABLE_NAME);
        RegionLocator regionLocator = HBaseCnx.getConnection().getRegionLocator(TABLE_NAME);
        HFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator);

        // Output
        Path out = new Path(output);
        FileOutputFormat.setOutputPath(job, out);

        // Launch the MR job
        logger.debug("Start - Map Reduce job to produce HFiles");
        boolean b = job.waitForCompletion(true);
        if (!b) throw new RuntimeException("FAIL - Produce HFiles for HBase bulk load");
        logger.debug("End - Map Reduce job to produce HFiles");

        // Make the output HFiles usable by HBase (permissions)
        logger.debug("Start - Set the permissions for HBase in the output dir " + out.toString());
        //fs.setPermission(outputPath, new FsPermission(ALL, ALL, ALL)); => not recursive
        FsShell shell = new FsShell(getConf());
        shell.run(new String[]{"-chmod", "-R", "777", out.toString()});
        logger.debug("End - Set the permissions for HBase in the output dir " + out.toString());

        // Run complete bulk load
        logger.debug("Start - HBase Complete Bulk Load");
        LoadIncrementalHFiles loadIncrementalHFiles = new LoadIncrementalHFiles(getConf());
        int loadIncrementalHFilesOutput = loadIncrementalHFiles.run(new String[]{out.toString(), TABLE_NAME.toString()});
        if (loadIncrementalHFilesOutput != 0) {
            throw new RuntimeException("Problem in LoadIncrementalHFiles. Return code is " + loadIncrementalHFiles);
        }
        logger.debug("End - HBase Complete Bulk Load");

我的映射器读取Parquet文件并发出:

  • 键,是Put as ImmutableBytesWritable的行键
  • 值为HBase Put的值

问题在“减少”步骤中发生。在每个Reducer的“系统日志”中,我看到的错误似乎与套接字连接有关。这是一个系统日志:

2018-09-04 08:21:39,085 INFO [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2018-09-04 08:21:39,086 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
2018-09-04 08:21:55,705 ERROR [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
2018-09-04 08:21:55,705 WARN [main] org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x3ecedf210x0, quorum=localhost:2181, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2018-09-04 08:21:55,706 ERROR [main] org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: hconnection-0x3ecedf210x0, quorum=localhost:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2018-09-04 08:21:55,706 WARN [main] org.apache.hadoop.hbase.client.ZooKeeperRegistry: Can't retrieve clusterId from Zookeeper

在Google中进行几次搜索后,我发现了几条建议直接在Java代码中设置仲裁IP的帖子。我也这样做了,但是没有用。这是我目前获得HBase连接的方式

Configuration conf = HBaseConfiguration.create();
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));

    // Attempts to set directly the quorum IP in the Java code that did not work
    //conf.clear();
    //conf.set("hbase.zookeeper.quorum", "...ip...");
    //conf.set("hbase.zookeeper.property.clientPort", "2181");

    Connection cnx = ConnectionFactory.createConnection(conf);

我不明白的是其他所有东西都在工作。我可以以编程方式创建表,查询表(扫描或获取)。我什至可以使用通过TableMapReduceUtil.initTableReducerJob("my_table", IdentityTableReducer.class, job);插入数据的MR作业。但是,它当然要比直接写入根据现有区域拆分的HFile的HBase完整批量加载技术快得多。

谢谢您的帮助

1 个答案:

答案 0 :(得分:0)

我一直在进行类似的迁移。问题在于,Reducer在单独的进程中运行,因此您需要在作业的配置上设置仲裁。这将使减速器可以使用该值。

job.getConfiguration().set("hbase.zookeeper.quorum", "...ip...");