我正在尝试删除HBase表中的所有数据,该表的时间戳早于指定的时间戳。它包含所有列族和行。
有没有办法可以使用shell和Java API来完成?
答案 0 :(得分:4)
HBase没有范围删除标记的概念。这意味着如果您需要删除多个单元格,则需要为每个单元格放置删除标记,这意味着您必须在客户端或服务器端扫描每一行。这意味着您有两个选择:
扫描和删除:这是一个干净且最简单的选项。由于您说您需要删除早于特定时间戳的所有列族,因此可以通过使用服务器端过滤来仅读取每行的第一个键来大大优化扫描和删除操作。
Scan scan = new Scan();
scan.setTimeRange(0, STOP_TS); // STOP_TS: The timestamp in question
// Crucial optimization: Make sure you process multiple rows together
scan.setCaching(1000);
// Crucial optimization: Retrieve only row keys
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL,
new FirstKeyOnlyFilter(), new KeyOnlyFilter());
scan.setFilter(filters);
ResultScanner scanner = table.getScanner(scan);
List<Delete> deletes = new ArrayList<>(1000);
Result [] rr;
do {
// We set caching to 1000 above
// make full use of it and get next 1000 rows in one go
rr = scanner.next(1000);
if (rr.length > 0) {
for (Result r: rr) {
Delete delete = new Delete(r.getRow(), STOP_TS);
deletes.add(delete);
}
table.delete(deletes);
deletes.clear();
}
} while(rr.length > 0);
答案 1 :(得分:0)
Yes, this can be done easily by setting time range to scanner and then deleting the returned result set.
public class BulkDeleteDriver {
//Added colum family and column to lessen the scan I/O
private static final byte[] COL_FAM = Bytes.toBytes("<column family>");
private static final byte[] COL = Bytes.toBytes("column");
final byte[] TEST_TABLE = Bytes.toBytes("<TableName>");
public static void main(final String[] args) throws IOException,
InterruptedException {
//Create connection to Hbase
Configuration conf = null;
Connection conn = null;
try {
conf = HBaseConfiguration.create();
//Path to HBase-site.xml
conf.addResource(new Path(hbasepath));
//Get the connection
conn = ConnectionFactory.createConnection(conf);
logger.info("Connection created successfully");
}
catch (Exception e) {
logger.error(e + "Connection Unsuccessful");
}
//Get the table instance
Table table = conn.getTable(TableName.valueOf(TEST_TABLE));
List<Delete> listOfBatchDeletes = new ArrayList<Delete>();
long recordCount = 0;
// Set scanCache if required
logger.info("Got The Table : " + table.getName());
//Get calendar instance and get proper start and end timestamps
Calendar calStart = Calendar.getInstance();
calStart.add(Calendar.DAY_OF_MONTH, day);
Calendar calEnd = Calendar.getInstance();
calEnd.add(Calendar.HOUR, hour);
//Get timestamps
long starTS = calStart.getTimeInMillis();
long endTS = calEnd.getTimeInMillis();
//Set all scan related properties
Scan scan = new Scan();
//Most important part of code set it properly!
//here my purpose it to delete everthing Present Time - 6 hours
scan.setTimeRange(starTS, endTS);
scan.setCaching(scanCache);
scan.addColumn(COL_FAM, COL);
//Scan the table and get the row keys
ResultScanner resultScanner = table.getScanner(scan);
for (Result scanResult : resultScanner) {
Delete delete = new Delete(scanResult.getRow());
//Create batches of Bult Delete
listOfBatchDeletes.add(delete);
recordCount++;
if (listOfBatchDeletes.size() == //give any suitable batch size here) {
System.out.println("Firing Batch Delete Now......");
table.delete(listOfBatchDeletes);
//don't forget to clear the array list
listOfBatchDeletes.clear();
}}
System.out.println("Firing Final Batch of Deletes.....");
table.delete(listOfBatchDeletes);
System.out.println("Total Records Deleted are.... " + recordCount);
try {
table.close();
} catch (Exception e) {
e.printStackTrace();
logger.error("ERROR", e);
}}}
答案 2 :(得分:0)
如果你想从shell中取出数据,不想写Java Client,那么你可以进行如下操作:
#!/bin/bash
start_time=1607731200000
end_time=1607817600000
row_key_file="/tmp/$start_time-$end_time.rowkey"
touch $row_key_file
now=$(date +'%Y-%m-%d:%H-%M-%S')
echo "$now: scanning records from date range $start_time to $end_time"
echo -e "scan 'YOUR_TABLE_NAME', {TIMERANGE => [$start_time, $end_time]}" | hbase shell -n | awk -F ' ' '{if(length($1) > 20){print $1}}' > $row_key_file
rows_scanned=$(wc -l $row_key_file | cut -d' ' -f1)
echo "Rows scanned: $rows_scanned"
echo "deleting rows"
echo -e "File.foreach('$row_key_file') { |line| key=line.strip; deleteall 'extract_job_results', key; }" | hbase shell -n
now=$(date +'%Y-%m-%d:%H-%M-%S')
echo "$now: Data truncation completed"
start_time 和 end_time 是以毫秒为单位的开始和结束时间范围的纪元。这将删除时间范围内的所有行。