Question

是否有任何api可以使用Spark Scala删除特定的HBase单元格。我们可以使用Spark-HBase Connector进行读写。任何关于细胞缺失的建议都是非常值得注意的。

Answer 1

以下是使用Spark删除HBase Cell对象的实现（我使用parallelize进行了演示，您可以将其调整为您的Cells RDD）。

一般想法：以块的形式删除 - 遍历每个RDD分区，将分区拆分为10,000个单元的块，将每个Cell转换为HBase Delete对象，然后调用table.delete()从HBase执行删除

public void deleteCells(List<Cell> cellsToDelete) {

    JavaSparkContext sc = new JavaSparkContext();

    sc.parallelize(cellsToDelete)
        .foreachPartition(cellsIterator -> {
            int chunkSize = 100000; // Will contact HBase only once per 100,000 records

            org.apache.hadoop.conf.Configuration config = new org.apache.hadoop.conf.Configuration();
            config.set("hbase.zookeeper.quorum", "YOUR-ZOOKEEPER-HOSTNAME");

            Table table;

            try {
                Connection connection = ConnectionFactory.createConnection(config);
                table = connection.getTable(TableName.valueOf(config.get("YOUR-HBASE-TABLE")));
            }
            catch (IOException e)
            {
                logger.error("Failed to connect to HBase due to inner exception: " + e);

                return;
            }

            // Split the given cells iterator to chunks
            Iterators.partition(cellsIterator, chunkSize)
                .forEachRemaining(cellsChunk -> {
                    List<Delete> deletions = Lists.newArrayList(cellsChunk
                            .stream()
                            .map(cell -> new Delete(cell.getRowArray(), cell.getRowOffset(), cell.getRowLength())
                                    .addColumn(cell.getFamily(), cell.getQualifier(), System.currentTimeMillis()))
                            .iterator());

                    try {
                        table.delete(deletions);
                    } catch (IOException e) {
                        logger.error("Failed to delete a chunk due to inner exception: " + e);
                    }
                });

        });
}

免责声明：此精确片段未经过测试，但我使用相同的方法使用Spark删除数十亿HBase Cell。

使用spark

1 个答案: