Question

我正在基于二元存在或不存在一组特征来执行对象的比较。这些功能可以用位字符串表示，例如：

这个位串具有第一，第四和第五个特征。

我正在尝试将一对位串的相似度计算为两者共有的特征数。对于给定的一组位串，我知道它们都具有相同的长度，但我不知道在编译时该长度是多少。

例如，这两个字符串有两个共同的特征，所以我希望相似函数返回2：

s(10011,10010) = 2

如何在C ++中有效地表示和比较位串？

Answer 1

您可以使用std::bitset STL类。

它们可以用位串构建，ANDed，并计算1：

的数量

#include <string>
#include <bitset>

int main()
{
  std::bitset<5> option1(std::string("10011")), option2(std::string("10010"));
  std::bitset<5> and_bit = option1 & option2; //bitset will have 1s only on common options
  size_t s = and_bit.count ();                //return the number of 1 in the bitfield
  return 0;
}

修改

如果在编译时未知位数，则可以使用boost::dynamic_bitset<>：

boost::dynamic_bitset<> option(bit_string);

示例的其他部分不会更改，因为boost::dynamic_bitset<>与std::bitset共享一个公共接口。

Answer 2

更快的算法：

int similarity(unsigned int a, unsigned int b)
{
   unsigned int r = a & b;
   r = ( r & 0x55555555 ) + ((r >> 1) & 0x55555555 );
   r = ( r & 0x33333333 ) + ((r >> 2) & 0x33333333 );
   r = ( r & 0x0f0f0f0f ) + ((r >> 4) & 0x0f0f0f0f );
   r = ( r & 0x00ff00ff ) + ((r >> 8) & 0x00ff00ff );
   r = ( r & 0x0000ffff ) + ((r >>16) & 0x0000ffff );
   return r;
}

int main() {
        unsigned int a = 19 ;//10011
        unsigned int b = 18 ;//10010
        cout << similarity(a,b) << endl; 
        return 0;
}

输出：

在ideone上演示：http://www.ideone.com/bE4qb

Answer 3

由于您在编译时不知道位长，因此可以使用boost::dynamic_bitset而不是std::bitset。

然后，您可以使用operator&（或&=）查找公共位，并使用boost::dynamic_bitset::count()对其进行计数。

表现取决于。对于最大速度，根据您的编译器，您可能必须自己实现循环，例如使用@ Nawaz的方法，或来自Bit Twiddling Hacks的方法，或者使用汇编程序/编译器内在函数为sse / popcount / etc编写循环。

请注意，至少llvm，gcc和icc会检测到这种类型的许多模式并为您优化，因此在进行手动工作之前，请检查生成的代码。

Answer 4

使用std::bitset，如果你的一组特征小于一个长的位数（我认为它很长），你可以得到这些位的无符号长表示，然后和这两个值，并使用来自here的位琐事来计算。

如果您想继续使用字符串来表示您的位模式，您可以使用boost中的zip_iterator执行以下操作。

#include <iostream>
#include <string>
#include <algorithm>

#include <boost/tuple/tuple.hpp>
#include <boost/iterator/zip_iterator.hpp>

struct check_is_set :
  public std::unary_function<const boost::tuple<char const&, char const&>&, bool>
{
  bool operator()(const boost::tuple<char const&, char const&>& t) const
  {
    const char& cv1 = boost::get<0>(t);
    const char& cv2 = boost::get<1>(t);
    return cv1 == char('1') && cv1 == cv2;
  }
};

size_t count_same(std::string const& opt1, std::string const& opt2)
{
  std::string::const_iterator beg1 = opt1.begin();
  std::string::const_iterator beg2 = opt2.begin();

  // need the same number of items for end (this really is daft, you get a runtime
  // error if the sizes are different otherwise!! I think it's a bug in the
  // zip_iterator implementation...)
  size_t end_s = std::min(opt1.size(), opt2.size());
  std::string::const_iterator end1 = opt1.begin() + end_s;
  std::string::const_iterator end2 = opt2.begin() + end_s;

  return std::count_if(
  boost::make_zip_iterator(
    boost::make_tuple(beg1, beg2)
    ),
  boost::make_zip_iterator(
    boost::make_tuple(end1, end2)
    ),
    check_is_set()
  );
}

int main(void)
{
  std::string opt1("1010111");
  std::string opt2("001101");

  std::cout << "same: " << count_same(opt1, opt2) << std::endl;

  return 0;
}

如何在C ++中快速比较可变长度位串？

4 个答案: