我正在尝试使用MurmurHash3哈希函数编写Bloom过滤器的C ++实现。我的实施基于以下网站:http://blog.michaelschmatz.com/2016/04/11/how-to-write-a-bloom-filter-cpp/
不知何故,在我的BloomFilter头文件中,hash函数抛出了一个不完整的类型错误,当我在add函数中使用hash函数时,我得到一个“hash is ambigious error”。
我该怎么做才能解决这个问题?我对C ++有些新意,所以我不确定我是否正确使用了结构的接口/实现。
我还使用了一个包含此文件的main函数,并运行一些测试来分析误报率,位数,过滤器大小等。 。 。
#ifndef BLOOM_FILTER_H
#define BLOOM_FILTER_H
#include "MurmurHash3.h"
#include <vector>
//basic structure of a bloom filter object
struct BloomFilter {
BloomFilter(uint64_t size, uint8_t numHashes);
void add(const uint8_t *data, std::size_t len);
bool possiblyContains(const uint8_t *data, std::size_t len) const;
private:
uint8_t m_numHashes;
std::vector<bool> m_bits;
};
//Bloom filter constructor
BloomFilter::BloomFilter(uint64_t size, uint8_t numHashes)
: m_bits(size),
m_numHashes(numHashes) {}
//Hash array created using the MurmurHash3 code
std::array<uint64_t, 2> hash(const uint8_t *data, std::size_t len)
{
std::array<uint64_t, 2> hashValue;
MurmurHash3_x64_128(data, len, 0, hashValue.data());
return hashValue;
}
//Hash array created using the MurmurHash3 code
inline uint64_t nthHash(uint8_t n,
uint64_t hashA,
uint64_t hashB,
uint64_t filterSize) {
return (hashA + n * hashB) % filterSize;
}
//Adds an element to the array
void BloomFilter::add(const uint8_t *data, std::size_t len) {
auto hashValues = hash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())] = true;
}
}
//Returns true or false based on a probabilistic assesment of the array using MurmurHash3
bool BloomFilter::possiblyContains(const uint8_t *data, std::size_t len) const {
auto hashValues = hash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
if (!m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())])
{
return false;
}
}
return true;
}
#endif
答案 0 :(得分:4)
如果你的MurmurHash3_x64_128返回两个64位数字作为哈希值,我会把它当作4个不同的uint32_t哈希值,只要你的位串不需要超过40亿位。最有可能的是,您不需要超过2-3个散列,但这取决于您的用例。要弄清楚您需要多少哈希,可以查看"How many hash functions does my bloom filter need?"。
使用MurmurHash3_x64_128我会这样做(如果我将其视为4 x uint32_t散列):
void BloomFilter::add(const uint8_t *data, std::size_t len) {
auto hashValues = hash(data, len);
uint32_t* hx = reinterpret_cast<uint32_t*>(&hashValues[0]);
assert(m_numHashes <= 4);
for (int n = 0; n < m_numHashes; n++)
m_bits[hx[n] % m_bits.size()] = true;
}
您的代码存在类型转换的一些问题,这就是它无法编译的原因:
#include <array>
myhash
)并将其设置为静态。以下是带有这些更正的代码版本,这应该有效:
#ifndef BLOOM_FILTER_H
#define BLOOM_FILTER_H
#include "MurmurHash3.h"
#include <vector>
#include <array>
//basic structure of a bloom filter object
struct BloomFilter {
BloomFilter(size_t size, uint8_t numHashes);
void add(const uint8_t *data, std::size_t len);
bool possiblyContains(const uint8_t *data, std::size_t len) const;
private:
uint8_t m_numHashes;
std::vector<bool> m_bits;
};
//Bloom filter constructor
BloomFilter::BloomFilter(size_t size, uint8_t numHashes)
: m_bits(size),
m_numHashes(numHashes) {}
//Hash array created using the MurmurHash3 code
static std::array<uint64_t, 2> myhash(const uint8_t *data, std::size_t len)
{
std::array<uint64_t, 2> hashValue;
MurmurHash3_x64_128(data, len, 0, hashValue.data());
return hashValue;
}
//Hash array created using the MurmurHash3 code
inline size_t nthHash(int n,
uint64_t hashA,
uint64_t hashB,
size_t filterSize) {
return (hashA + n * hashB) % filterSize; // <- not sure if that is OK, perhaps it is.
}
//Adds an element to the array
void BloomFilter::add(const uint8_t *data, std::size_t len) {
auto hashValues = myhash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())] = true;
}
}
//Returns true or false based on a probabilistic assesment of the array using MurmurHash3
bool BloomFilter::possiblyContains(const uint8_t *data, std::size_t len) const {
auto hashValues = myhash(data, len);
for (int n = 0; n < m_numHashes; n++)
{
if (!m_bits[nthHash(n, hashValues[0], hashValues[1], m_bits.size())])
{
return false;
}
}
return true;
}
#endif
如果您刚开始使用c ++,首先从基本示例开始,尝试使用std::hash吗?创建工作实现,然后使用可选的哈希函数参数扩展它。如果你需要你的BloomFilter很快,我可能会远离vector<bool>
并使用无符号整数数组。
基本impl可能是这样的,前提是你已经实现了MurmurHash3
:
uint32_t MurmurHash3(const char *str, size_t len);
class BloomFilter
{
public:
BloomFilter(int count_elements = 0, double bits_per_element = 10)
{
mem = NULL;
init(count_elements, bits_per_element);
}
~BloomFilter()
{
delete[] mem;
}
void init(int count_elements, double bits_per_element)
{
assert(!mem);
sz = (uint32_t)(count_elements*bits_per_element + 0.5);
mem = new uint8_t[sz / 8 + 8];
}
void add(const std::string &str)
{
add(str.data(), str.size());
}
void add(const char *str, size_t len)
{
if (len <= 0)
return;
add(MurmurHash3(str, len));
}
bool test(const std::string &str)
{
return test(str.data(), str.size());
}
bool test(const char *str, size_t len)
{
return test_hash(MurmurHash3(str, len));
}
bool test_hash(uint32_t h)
{
h %= sz;
if (0 != (mem[h / 8] & (1u << (h % 8))))
return true;
return false;
}
int mem_size() const
{
return (sz + 7) / 8;
}
private:
void add(uint32_t h)
{
h %= sz;
mem[h / 8] |= (1u << (h % 8));
}
public:
uint32_t sz;
uint8_t *mem;
};