Question

我使用std::vector<std::vector<T> >的数据类型来存储2D矩阵/数组。我想确定这个矩阵的唯一行。我正在寻找有关如何进行此操作的任何建议或指示。

我尝试了两种方法。

方法1：略微复杂。我为每行保留一个索引，其中0/1表示该行是否为重复值，并通过矩阵工作，将每个唯一行的索引存储在deque中。我想将结果存储在<vector<vector<T> >中，因此从这个索引的deque中，我预先分配然后将矩阵中的行分配给返回值。

方法2：更容易阅读，并且在许多情况下比方法1更快。我保留已找到的唯一行的双端队列，并循环遍历行并将每行与此{中的所有条目进行比较{1}}。

我正在将这两种方法与matlab进行比较，这些C ++例程的速度要低几个数量级。有没有人对如何加快这项操作有任何聪明的想法？我希望在可能有数百万行的矩阵上执行此操作。

我在循环期间将唯一行存储在双端队列中，以避免调整向量大小的成本，然后将deque复制到deque以获得结果。我已经对此操作进行了密切的基准测试，并且它的运行速度并没有接近减速，例如，在占用100,000行的矩阵上，它占运行时间的不到0.5％。

谢谢，

鲍勃

这是代码。如果有人对显示用法的更完整的示例感兴趣，请给我发表评论，然后我就可以把它放在一起。

方法1：

vector<vector<T> >

以下是方法2的代码：

  template <typename T>
      void uniqueRows( const std::vector<std::vector<T> > &A,
                       std::vector<std::vector<T> > &ret) {
    // Go through a vector<vector<T> > and find the unique rows
    // have a value ind for each row that is 1/0 indicating if a value
    // has been previously searched.

    // cur : current item being compared to every item
    // num : number of values searched for.  Once all the values in the
    //  matrix have been searched, terminate.

    size_t N = A.size();
    size_t num=1,cur=0,it=1;
    std::vector<unsigned char> ind(N,0);
    std::deque<size_t> ulist;  // create a deque to store the unique inds

    ind[cur] = 1;
    ulist.push_back(0); // ret.push_back(A[0]);

    while(num < N ) {

      if(it >= N ) {
        ++cur;   // find next non-duplicate value, push back
        while(ind[cur])
          ++cur;

        ulist.push_back(cur); //ret.push_back(A[cur]);
        ++num;
        it = cur+1; // start search for duplicates at the next row

        if(it >= N && num == N)
          break;
      }

      if(!ind[it] && A[cur]==A[it]) {        
        ind[it] = 1; // mark as duplicate
        ++num;
      }
      ++it;
    } // ~while num

    // loop over the deque and .push_back the unique vectors    
    std::deque<size_t>::iterator iter;
    const std::deque<size_t>::iterator end = ulist.end();
    ret.reserve(ulist.size());

    for(iter= ulist.begin(); iter != end; ++iter) {
      ret.push_back(A[*iter]);
    }
  }

Answer 1

编辑：我忘记了std :: vector已经定义了operator<和operator==所以你甚至不需要使用它：

template <typename t>
std::vector<std::vector<t> > GetUniqueRows(std::vector<std::vector<t> > input)
{
    std::sort(input.begin(), input.end());
    input.erase(std::unique(input.begin(), input.end()), input.end());
    return input;
}

~~使用std::unique与自定义函子一起使用，该函数在两个向量上调用std::equal。~~

std::unique要求首先对输入进行排序。在两个向量输入上使用调用std::lexicographical_compare的自定义函子。如果您需要恢复无序输出，则需要以某种方式存储现有订单。这将实现排序操作的M * n log n复杂度（其中M是内向量的长度，n是内向量的数量），而std::unique调用将花费m*n时间。

为了进行比较，您现有的方法都是m * n ^ 2次。

编辑：示例：

template <typename t> struct VectorEqual : std::binary_function<const std::vector<t>&, const std::vector<t>&, bool> { bool operator()(const std::vector<t>& lhs, const std::vector<t>& rhs) { if (lhs.size() != rhs.size()) return false; return std::equal(lhs.first(), lhs.second(), rhs.first()); } }; template <typename t> struct VectorLess : std::binary_function<const std::vector<t>&, const std::vector<t>&, bool> { bool operator()(const std::vector<t>& lhs, const std::vector<t>& rhs) { return std::lexicographical_compare(lhs.first(), lhs.second(), rhs.first(), rhs.second()); } }; template <typename t> std::vector<std::vector<t> > GetUniqueRows(std::vector<std::vector<t> > input) { std::sort(input.begin(), input.end(), VectorLess<t>()); input.erase(std::unique(input.begin(), input.end(), VectorEqual<t>()), input.end()); return input; }

Answer 2

您还应该考虑使用散列，保留行排序 并且可以更快（摊销O(m*n) < / strong>如果允许更改原文，O(2*m*n)如果需要副本），则sort / unique - 对于大型矩阵尤其明显（在小矩阵上，您可能会感觉更好）与Billy的解决方案一起，因为他不需要额外的内存分配来跟踪哈希值。）

无论如何，利用Boost.Unordered，这是你可以做的：

#include <vector> #include <boost/foreach.hpp> #include <boost/ref.hpp> #include <boost/typeof/typeof.hpp> #include <boost/unordered_set.hpp> namespace boost { template< typename T > size_t hash_value(const boost::reference_wrapper< T >& v) { return boost::hash_value(v.get()); } template< typename T > bool operator==(const boost::reference_wrapper< T >& lhs, const boost::reference_wrapper< T >& rhs) { return lhs.get() == rhs.get(); } } // destructive, but fast if the original copy is no longer required template <typename T> void uniqueRows_inplace(std::vector<std::vector<T> >& A) { boost::unordered_set< boost::reference_wrapper< std::vector< T > const > > unique(A.size()); for (BOOST_AUTO(it, A.begin()); it != A.end(); ) { if (unique.insert(boost::cref(*it)).second) { ++it; } else { A.erase(it); } } } // returning a copy (extra copying cost) template <typename T> void uniqueRows_copy(const std::vector<std::vector<T> > &A, std::vector< std::vector< T > > &ret) { ret.reserve(A.size()); boost::unordered_set< boost::reference_wrapper< std::vector< T > const > > unique; BOOST_FOREACH(const std::vector< T >& row, A) { if (unique.insert(boost::cref(row)).second) { ret.push_back(row); } } }

确定2D数组的唯一行（vector <vector <t>＆gt;）</vector <t>

2 个答案: