应用错误收集

Q1）为什么Hbase需要WAL？

id用于恢复目的。通过MapR docs让我们更深入地了解hbase架构。

当客户端发出Put请求时，第一步是将数据写入预写日志WAL：

将编辑内容附加到存储在磁盘上的WAL文件的末尾。
WAL用于在服务器崩溃时恢复尚未持久的数据。

将数据写入WAL之后，它将放置在MemStore中。然后，放置请求确认将返回给客户端。

Q2）每次我放置或删除数据时，Hbase都必须写入WAL，为什么不仅仅在其数据文件中对其进行操作？

如果id已启用 .. 是

如果0被禁用，则可以消除写入m = df.value.diff().ne(0).cumsum().rename('gid') #Consecutive rows having the same value will be assigned same IDNumber by this command. #It is the way to identify a group of consecutive rows having the same value, so I called it groupID. df1 = df.groupby(['id', m]).value.value_counts().loc[:,:,0].droplevel(-1) #this groupby groups consecutive rows of same value per ID into separate groups. #within each group, count number of each value and `.loc` to pick specifically only `0` because we only concern on the count of value `0`. df1[~df1.index.duplicated()].reindex(df.id.unique(), fill_value=0) #There're several groups of value `0` per `id`. We want only group of highest count. #`value_count` already sorted number of count descending, so we just need to pick #the top one of duplicates by slicing on True/False mask of `duplicated`. #finally, `reindex` adding any `id` doesn't have value 0 in original `df`. #Note: `id` is the column `id` in `df`. It is different from groupID `m` we create to use with groupby Out[315]: id 354 3 357 2 540 0 Name: value, dtype: int64的额外开销，直接对文件进行操作。

注意：

一般情况下WAL将被禁用，以实现突变（行级突变）/写性能。如果这样做的话，根本的警告是，不要恢复……意味着数据丢失。另外，如果您使用的是SOLR，它将在WAL上运行，因此不会更新SOLR文档。如果没有，您可以禁用WAL

进一步阅读请参见my answer here

HBase已经是own ACID semantics：http://hbase.apache.org/acid-semantics.html

它需要一个WAL，以便它可以在RegionServer失败的情况下重放编辑。 WAL起着提供耐久性保证的重要作用。

WAL是可选的。您可以在HBase写入期间禁用WAL。如果禁用它，您将看到一些性能改进。但是，可能存在一些群集故障/灾难情况，您可能会丢失一些数据。所以，它的权衡取决于你的用例。

如果RegionServer崩溃，我们可以从WAL恢复编辑，如果没有WAL，则在刷新每个MemStore并写入新的StoreFiles之前，RegionServer发生故障时可能会丢失数据。您可以找到更多信息here

为什么Hbase需要WAL？

3 个答案: