Question

我想构建一个函数来执行一个文件分析，它在数组中返回每个字节数从0x0到0xff及其频率。

所以，我写了这个原型：

// function prototype  and other stuff

unsigned int counts[256] = {0}; // byte lookup table 
FILE * pFile;                   // file handle
long fsize;             // to store file size
unsigned char* buff;            // buffer
unsigned char* pbuf;            // later, mark buffer start
unsigned char* ebuf;            // later, mark buffer end

if ( ( pFile = fopen ( FNAME , "rb" ) ) == NULL )
{
    printf("Error");
    return -1;
}
else
{
    //get file size
    fseek (pFile , 0 , SEEK_END);
    fsize = ftell (pFile);
    rewind (pFile);

    // allocate space ( file size + 1 )
    // I want file contents as string for populating it
    // with pointers
    buff = (unsigned char*)malloc( sizeof(char) * fsize + 1 );

    // read whole file into memory
    fread(buff,1,fsize,pFile);

    // close file
    fclose(pFile);

    // mark end of buffer as string
    buff[fsize] = '\0';

    // set the pointers to beginning and end
    pbuf = &buff[0];
    ebuf = &buff[fsize];


            // Here the Bottleneck
    // iterate entire file byte by byte
            // counting bytes 
    while ( pbuf != ebuf)
    {
        printf("%c\n",*pbuf);
                    // update byte count
        counts[(*pbuf)]++;
        ++pbuf;                             
    }


    // free allocated memory
    free(buff);
    buff = NULL;

}
// printing stuff

但这种方式比较慢。我找到相关的算法，因为我见过HxD 做得更快。

我想也许一次读取一些字节可能是一个解决方案，但我不知道如何。

我需要一只手或建议。

感谢。

Answer 1

假设您的文件不是很大，它会导致系统开始分页，因为您正在将整个内容读入内存，您的算法与通用数据一样好O(n)。

您需要删除printf（如上所述）;但除此之外如果性能不高于改进它的唯一方法就是查看生成的汇编程序 - 可能编译器没有优化所有de-references（gcc 应该做）虽然）。

如果您碰巧对数据集有所了解，那么可能会有一些改进 - 如果它是一个可能具有相同字节块的位图类型图像，那么可能值得进行一点运行长度编码。还可能存在一些数据集，其中实际上值得首先对数据进行排序（尽管这会将一般情况降低到O(nlog(n))，因此不太可能。

rle看起来像（未经测试，可能在我的头部免责声明之下是次优的）

unsigned int cur_count=1;
unsigned char cbuf=*(++pbuf);

while ( pbuf != ebuf)
{
    while( pbuf != ebuf && cbuf == *pbuf )
    {
        cur_count++;
        pbuf++;
    }  
    counts[cbuf]+=cur_count;
    cur_count=0;                             
}
counts[cbuf]+=cur_count;

Answer 2

您可以经常交换程序大小的增加以提高速度，我认为这可以很好地适用于您的情况。我会考虑用unsigned short *指针替换你的unsigned char *指针，并且一次有效地处理两个字节。这样，你的数组索引增量的数量减半，累加器的偏移量计算数量增加一半，累加数量增加一半，测试数量增加一半，以查看循环是否已完成。

就像我说的那样，这将以增加程序大小为代价，因此你的累加器阵列现在需要65536个元素而不是256个，但这是一个很小的代价。我承认也有可读性的权衡。

最后，你必须通过我的新的更大累加器的所有65536个元素运行索引，并用0xff屏蔽它以获得第一个字节并移位8位以获得第二个字节。然后你将有两个索引对应你的原始累加器，你可以从那里将2累积到你原来的256累加器中。

P.S。请注意，虽然您一次可以处理几乎所有文件2个字节，但如果文件大小是奇数个字节，则必须自己处理最后一个字节。

P.P.S。请注意，如果你想让备用的3个CPU内核做一些比翻转拇指更有用的东西，这个问题很容易在4个线程中并行化.--）

如何改进这种字符计数算法

2 个答案: