Question

我想了解Hbase如何在内部处理文件中的重复记录。为了实验这一点，我在hive中创建了一个EXTERNAL表，其中包含HBase特定的配置属性，如表属性，SERDE，列族。我必须在HBase中使用列族创建表，我做了。

我已从具有重复记录的源表执行了对此HIVE表的插入覆盖。通过重复记录，我的意思是这样，

ID | Name        | Surname
 1 | Ritesh      | Rai
 1 | RiteshKumar | Rai

现在执行插入覆盖后，我查询了ID为1的HIVE表，输出为（第二个）

 1        RiteshKumar         Rai

我想知道HBase如何决定更新哪一个？只是它只是以顺序方式写入数据。最后一条记录将被覆盖并被视为最新记录？或者它是怎么回事？

提前致谢。

此致戈文德

Answer 1

你走在正确的轨道上！

HBase datamodel可被视为“多维地图”，每个单元格值都与时间戳相关联（默认情况下为insertion_time）：

row:column_family:column_qualifier:timestamp:value

注意：时间戳与每个单独的值相关联，而不是与整行相关联（这样可以启用几个不错的功能）！

在读取时，除非另有说明，否则默认情况下将获得最新版本。默认情况下，应存储3 versions。 Hbase执行“合并读取”，它将返回每行的最新单元格值。

请从你的hbase-shell尝试这个（在发布之前没有经过测试）：

put ‘table_name’, ‘1’, ‘f:name’, ‘Ritesh’
put ‘table_name’, ‘1’, ‘f:surname’, ‘Rai’
put ‘table_name’, ‘1’, ‘f:name’, ‘RiteshKumar’
put ‘table_name’, ‘1’, ‘f:surname’, ‘Rai’
put ‘table_name’, ‘1’, ‘f:other’, ‘Some other stuff’

// Data on 'disk' (that might just be the memstore for now) will look like this:
// 1:f:name:1234567890:‘Ritesh’
// 1:f:surname:1234567891:‘Rai’
// 1:f:name:1234567892:‘RiteshKumar’
// 1:f:surname:1234567893:‘Rai’
// 1:f:other:1234567894:‘Some other stuff’

// Now try... And you will get ‘RiteshKumar’, ‘Rai’, ‘Some other stuff’
get ‘table_name’, ‘1’

// To get the previous versions of the data use the following:
get ‘table_name’, ‘1’, {COLUMN => ‘f’, VERSIONS => 2}

不要忘记查看schema design

的最佳做法

Hbase如何处理重复记录？

1 个答案: