Question

比使用比较更有效地按字节比较数据 C ++列表容器的运算符？

我要比较[大？ 10 kByte＆lt;尺寸＆lt; 500字节数量的字节数据，以验证外部存储设备的完整性。

因此，我按字节顺序读取文件并将值存储在无符号字符列表中。此列表的资源由shared_ptr处理，因此我可以在程序中传递它而无需担心列表的大小

typedef boost::shared_ptr< list< unsigned char > > = contentPtr;
namespace boost::filesystem = fs;

contentPtr GetContent( fs::path filePath ){
 contentPtr actualContent (new list< unsigned char > );       
 // Read the file with a stream, put read values into actual content
return actualContent;

这样做了两次，因为文件总是有两个副本。必须比较这两个文件的内容，如果发现不匹配则抛出异常

void CompareContent() throw( NotMatchingException() ){
 // this part is very fast, below 50ms
 contentPtr contentA = GetContent("/fileA");
 contentPtr contentB = GetContent("/fileB");
 // the next part takes about 2secs with a file size of ~64kByte
 if( *contentA != *contentB )
      throw( NotMatchingException() );
}

我的问题是：
随着文件大小的增加，列表的比较变得非常缓慢。使用大约100 kByte的文件大小，比较内容最多需要两秒钟。随文件大小增加和减少....

有没有更有效的方法进行这种比较？这是用过的容器的问题吗？

Answer 1

请勿使用std::list使用std::vector。

std::list是链接列表，不保证元素连续存储。

另一方面，

std::vector似乎更适合于指定的任务（存储连续的字节并比较大块数据）。

如果您必须多次比较多个文件而不关心差异的位置，您还可以计算每个文件的哈希值并比较哈希值。这会更快。

Answer 2

我的第一条建议是分析您的代码。

我说的原因是，无论您的比较代码有多慢，我怀疑您的文件I / O时间相形见绌。您不想浪费时间来尝试优化代码的一部分，只需按原样运行1％的运行时。

甚至可能是之前你没有注意到的其他东西实际上导致了缓慢。在你描述之前你不会知道。

Answer 3

如果没有其他任何事情可以处理这些文件的内容（看起来你要让它们在CompareContent（）的范围结束时被shared_ptr删除），为什么不用迭代器来比较文件，根本没有创建任何容器？

这是我的一段代码，它按字节顺序比较两个文件：

// compare files
if (equal(std::istreambuf_iterator<char>(local_f),
          std::istreambuf_iterator<char>(),
          std::istreambuf_iterator<char>(host_f)))
{
    // we're good: move table to OutPath, remove other files

编辑：如果您需要保留内容，我认为std::deque可能比std::vector稍微更有效率，因为GOTW #54中解释的原因。或不 - 分析会告诉你。而且，仍然需要将两个相同文件中的一个加载到内存中 - 我将其中一个读入双端队列并与另一个文件的istreambuf_iterator进行比较。

Answer 4

在编写时，您正在比较两个文件的内容。然后你可以使用boost的mapped_files。你真的不需要阅读整个文件。你可以动态阅读（以增强方式的优化方式）并在找到第一个不等字节时停止...

就像Cubbi在这里回答的非常优雅的解决方案一样：http://www.cplusplus.com/forum/general/94032/请注意，在下面他还添加了一些基准，清楚地表明这是最快的方式。我将重写他的答案并添加零文件大小检查，否则抛出异常并将测试包含在函数中，以便从早期返回中受益：

#include <iostream>
#include <algorithm>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/filesystem.hpp>

namespace io = boost::iostreams;
namespace fs = boost::filesystem;

bool files_equal(const std::string& path1, const std::string& path2)
{
    fs::path f1(path1);
    fs::path f2(path2);

    if (fs::file_size(f1) != fs::file_size(f2))
        return false;

    // zero-sized files cannot be opened with mapped_file_source
    // hence we consider all zero-sized files equal
    if (fs::file_size(f1) == 0)
        return true;

    io::mapped_file_source mf1(f1.string());
    io::mapped_file_source mf2(f1.string());
    return std::equal(mf1.data(), mf1.data() + mf1.size(), mf2.data());
}

int main()
{
    if (files_equal("test.1", "test.2"))
        std::cout << "The files are equal.\n";
    else
        std::cout << "The files are not equal.\n";
}

Answer 5

std :: list对于char元素而言效率极低 - 每个元素都有开销来促进O（1）插入和删除，这实际上不是您的任务所需要的。

如果你必须使用STL，那么std :: vector或建议的迭代器方法将优于std :: list，但为什么不将数据读入包含在你选择的智能指针中的char *并使用memcmp？

Answer 6

使用memcmp以外的任何东西进行比较是很疯狂的。（除非你想要它更快，在这种情况下你可能想用汇编语言编写它。）

Answer 7

为了在memcmp-vs-equal辩论中客观性，我提供了以下基准程序，以便您可以自己查看系统中哪些更快（如果有的话）。它还测试operator ==。在我的系统上（Borland C ++ 5.5.1 for Win32）：

std :: equal：1375个时钟刻度
operator ==：1297 clock ticks
memcmp：1297时钟滴答

您的系统会发生什么？

#include <algorithm>
#include <vector>
#include <iostream>

using namespace std;

static char* buff ;
static vector<char> v0, v1 ;

static int const BufferSize = 100000 ;

static clock_t StartTimer() ;
static clock_t EndTimer (clock_t t) ;

int main (int argc, char** argv)
  {
  // Allocate a buffer
  buff = new char[BufferSize] ;

  // Create two vectors
  vector<char> v0 (buff, buff + BufferSize) ;
  vector<char> v1 (buff, buff + BufferSize) ;

  clock_t t ;

  // Compare them 10000 times using std::equal
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (!equal (v0.begin(), v0.end(), v1.begin()))
      cout << "Error in std::equal\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "std::equal: " << t << " clock ticks\n" ;

  // Compare them 10000 times using operator==
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (v0 != v1)
      cout << "Error in operator==\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "operator==: " << t << " clock ticks\n" ;

  // Compare them 10000 times using memcmp
  t = StartTimer() ;
  for (int i = 0 ; i < 10000 ; i++)
    if (memcmp (&v0[0], &v1[0], v0.size()))
      cout << "Error in memcmp\n", exit (1) ;
  t = EndTimer (t) ;
  cout << "memcmp: " << t << " clock ticks\n" ;

  return 0 ;
  }

static clock_t StartTimer()
  {
  // Start on a clock tick, to enhance reproducibility
  clock_t t = clock() ;
  while (clock() == t)
    ;
  return clock() ;
  }

static clock_t EndTimer (clock_t t)
  {
  return clock() - t ;
  }

以有效的方式逐字节地比较数据（使用C ++）

7 个答案: