根据时间范围删除HBase表中的所有数据?

时间:2016-09-30 09:16:07

标签: timestamp hbase hbase-shell

我正在尝试删除HBase表中的所有数据,该表的时间戳早于指定的时间戳。它包含所有列族和行。

有没有办法可以使用shell和Java API来完成?

3 个答案:

答案 0 :(得分:4)

HBase没有范围删除标记的概念。这意味着如果您需要删除多个单元格,则需要为每个单元格放置删除标记,这意味着您必须在客户端或服务器端扫描每一行。这意味着您有两个选择:

  1. BulkDeleteProtocol:这使用协处理器端点,这意味着整个操作将在服务器端运行。该链接有一个如何使用它的示例。如果您进行网络搜索,则可以轻松找到如何在HBase中启用协处理器端点。
  2. 扫描和删除:这是一个干净且最简单的选项。由于您说您需要删除早于特定时间戳的所有列族,因此可以通过使用服务器端过滤来仅读取每行的第一个键来大大优化扫描和删除操作。

    Scan scan = new Scan();
    scan.setTimeRange(0, STOP_TS);  // STOP_TS: The timestamp in question
    // Crucial optimization: Make sure you process multiple rows together
    scan.setCaching(1000);
    // Crucial optimization: Retrieve only row keys
    FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL,
        new FirstKeyOnlyFilter(), new KeyOnlyFilter());
    scan.setFilter(filters);
    ResultScanner scanner = table.getScanner(scan);
    List<Delete> deletes = new ArrayList<>(1000);
    Result [] rr;
    do {
      // We set caching to 1000 above
      // make full use of it and get next 1000 rows in one go
      rr = scanner.next(1000);
      if (rr.length > 0) {
        for (Result r: rr) {
          Delete delete = new Delete(r.getRow(), STOP_TS);
          deletes.add(delete);
        }
        table.delete(deletes);
        deletes.clear();
      }
    } while(rr.length > 0);
    

答案 1 :(得分:0)

Yes, this can be done easily by setting time range to scanner and then deleting the returned result set.

    public class BulkDeleteDriver {
    //Added colum family and column to lessen the scan I/O
    private static final byte[] COL_FAM = Bytes.toBytes("<column family>");
    private static final byte[] COL = Bytes.toBytes("column");
    final byte[] TEST_TABLE = Bytes.toBytes("<TableName>");

    public static void main(final String[] args) throws IOException,
    InterruptedException {
    //Create connection to Hbase
    Configuration conf = null;
    Connection conn = null;

    try {
    conf = HBaseConfiguration.create();
    //Path to HBase-site.xml
    conf.addResource(new Path(hbasepath));
    //Get the connection
    conn = ConnectionFactory.createConnection(conf);
    logger.info("Connection created successfully");
    } 
    catch (Exception e) {
    logger.error(e + "Connection Unsuccessful");
    }

    //Get the table instance
    Table table = conn.getTable(TableName.valueOf(TEST_TABLE));
    List<Delete> listOfBatchDeletes = new ArrayList<Delete>();
    long recordCount = 0;
    // Set scanCache if required
    logger.info("Got The Table : " + table.getName());

    //Get calendar instance and get proper start and end timestamps
    Calendar calStart = Calendar.getInstance();
    calStart.add(Calendar.DAY_OF_MONTH, day);
    Calendar calEnd = Calendar.getInstance();
    calEnd.add(Calendar.HOUR, hour);

    //Get timestamps
    long starTS = calStart.getTimeInMillis();
    long endTS = calEnd.getTimeInMillis();

    //Set all scan related properties
    Scan scan = new Scan();
    //Most important part of code set it properly!
    //here my purpose it to delete everthing Present Time - 6 hours
    scan.setTimeRange(starTS, endTS);
    scan.setCaching(scanCache);
    scan.addColumn(COL_FAM, COL);

    //Scan the table and get the row keys
    ResultScanner resultScanner = table.getScanner(scan);
    for (Result scanResult : resultScanner) {
    Delete delete = new Delete(scanResult.getRow());

    //Create batches of Bult Delete
    listOfBatchDeletes.add(delete);
    recordCount++;
    if (listOfBatchDeletes.size() == //give any suitable batch size here) {
    System.out.println("Firing Batch Delete Now......");
    table.delete(listOfBatchDeletes);
    //don't forget to clear the array list
    listOfBatchDeletes.clear();
    }}
    System.out.println("Firing Final Batch of Deletes.....");
    table.delete(listOfBatchDeletes);
    System.out.println("Total Records Deleted are.... " + recordCount);
    try {
    table.close();
    } catch (Exception e) {
    e.printStackTrace();
    logger.error("ERROR", e);
    }}}

答案 2 :(得分:0)

如果你想从shell中取出数据,不想写Java Client,那么你可以进行如下操作:

#!/bin/bash
start_time=1607731200000
end_time=1607817600000

row_key_file="/tmp/$start_time-$end_time.rowkey"
touch $row_key_file
now=$(date +'%Y-%m-%d:%H-%M-%S')

echo "$now: scanning records from date range $start_time to $end_time"
echo -e "scan 'YOUR_TABLE_NAME', {TIMERANGE => [$start_time, $end_time]}" | hbase shell -n | awk -F ' ' '{if(length($1) > 20){print $1}}' > $row_key_file

rows_scanned=$(wc -l $row_key_file | cut -d' ' -f1)
echo "Rows scanned: $rows_scanned"
echo "deleting rows"

echo -e "File.foreach('$row_key_file') { |line| key=line.strip; deleteall 'extract_job_results', key; }" | hbase shell -n
now=$(date +'%Y-%m-%d:%H-%M-%S')
echo "$now: Data truncation completed"

start_time 和 end_time 是以毫秒为单位的开始和结束时间范围的纪元。这将删除时间范围内的所有行。