实现递归散列算法

时间:2011-12-07 03:08:46

标签: c# algorithm filecompare

假设文件A有字节:

2
5
8
0
33
90
1
3
200
201
23
12
55

我有一个简单的哈希算法,其中我存储了最后三个连续字节的总和,所以:

2   
5   
8   - = 8+5+2 = 15
0   
33  
90  - = 90+33+0 = 123
1   
3   
200 - = 204
201 
23  
12  - = 236

所以我可以将文件A表示为15, 123, 204, 236

假设我将该文件复制到新计算机B并进行了一些小修改,文件B的字节为:

255 
2   
5   
8   
0   
33  
90  
1   
3   
200 
201 
23  
12
255
255 

“注意差异是文件开头的额外字节和最后2个额外字节,但其余字节非常相似”

所以我可以执行相同的算法来确定文件的某些部分是否相同。请记住,文件A由哈希码15, 123, 204, 236表示,让我们看看文件B是否给了我一些哈希码!

所以在文件B上我将每隔3个连续字节进行一次

int[] sums; // array where we will hold the sum of the last bytes


255 sums[0]  =          255     
2   sums[1]  =  2+ sums[0]    = 257     
5   sums[2]  =  5+ sums[1]    = 262     
8   sums[3]  =  8+ sums[2]    = 270  hash = sums[3]-sums[0]   = 15   --> MATHES FILE A!
0   sums[4]  =  0+ sums[3]    = 270  hash = sums[4]-sums[1]   = 13
33  sums[5]  =  33+ sums[4]   = 303  hash = sums[5]-sums[2]   = 41
90  sums[6]  =  90+ sums[5]   = 393  hash = sums[6]-sums[3]   = 123  --> MATHES FILE A!
1   sums[7]  =  1+ sums[6]    = 394  hash = sums[7]-sums[4]   = 124
3   sums[8]  =  3+ sums[7]    = 397  hash = sums[8]-sums[5]   = 94
200 sums[9]  =  200+ sums[8]  = 597  hash = sums[9]-sums[6]   = 204  --> MATHES FILE A!
201 sums[10] =  201+ sums[9]  = 798  hash = sums[10]-sums[7]  = 404
23  sums[11] =  23+ sums[10]  = 821  hash = sums[11]-sums[8]  = 424
12  sums[12] =  12+ sums[11]  = 833  hash = sums[12]-sums[9]  = 236  --> MATHES FILE A!
55  sums[13] =  55+ sums[12]  = 888  hash = sums[13]-sums[10] = 90
255 sums[14] =  255+ sums[13] = 1143    hash = sums[14]-sums[11] =  322
255 sums[15] =  255+ sums[14] = 1398    hash = sums[15]-sums[12] =  565

所以从查看该表我知道文件B包含来自文件A的字节加上额外的字节,因为哈希码匹配。

我展示这个算法的原因是因为它是n阶的顺序。换句话说,我能够计算最后3个连续字节的散列,而不必迭代它们!

如果我在哪里有更复杂的算法,例如执行最后3个字节的md5那么它将是n ^ 3的顺序,因为当我遍历文件BI时必须有一个内部for循环来计算最后三个字节的哈希。

所以我的问题是:

如何改进算法保持秩序n。那只是计算哈希一次。如果我使用现有的散列算法,如md5,我将不得不在算法中放置一个内部循环,这将显着增加算法的顺序。

请注意,可以使用乘法而不是添加来执行相同的操作。但柜台的增长速度非常快。也许我可以结合乘法和加法和减法......

修改

另外,如果我谷歌:

递归散列函数in-gram

出现了很多信息,我认为这些算法很难理解......

我必须为一个项目实现这个算法,这就是我重新发明轮子的原因......我知道那里有很多算法。

我想的另一种解决方案是执行相同的算法加上另一个强大的算法。所以在文件A上我将每3个字节执行相同的算法加上每3个字节的md5。在第二个文件中,如果第一个算法实现,我将执行第二个算法....

3 个答案:

答案 0 :(得分:2)

编辑:

我越是想到“递归”的意思,我越怀疑我之前提出的解决方案是你应该实现什么来做任何有用的事情。

您可能想要implement a hash tree algorithm,这是一种递归操作。

要执行此操作,您可以对列表进行哈希处理,将列表分成两部分,然后递归到这两个子列表中。当列表的大小为1或最小的所需散列大小时终止,因为每个递归级别将使总哈希输出的大小加倍。

的伪代码:

create-hash-tree(input list, minimum size: default = 1):
  initialize the output list
  hash-sublist(input list, output list, minimum size)
  return output list

hash-sublist(input list, output list, minimum size):
  add sum-based-hash(list) result to output list // easily swap hash styles here
  if size(input list) > minimum size:
    split the list into two halves
    hash-sublist(first half of list, output list, minimum size)
    hash-sublist(second half of list, output list, minimum size)

sum-based-hash(list):
  initialize the running total to 0

  for each item in the list:
    add the current item to the running total

  return the running total

我认为整个算法的运行时间为O(hash(m)); m = n * (log(n) + 1)hash(m)通常为线性时间。

存储空间类似于O(hash * s); s = 2n - 1,散列通常是常量大小。

请注意,对于C#,我将输出列表设为List<HashType>,但我将输入列表设为IEnumerable<ItemType>以节省存储空间,并使用Linq快速“拆分”列表没有分配两个新的子列表。

原始

我认为你可以将此作为O(n + m)执行时间;其中n是列表的大小,m是运行记录的大小,n < m(否则所有总和都相等)。

使用双端队列

内存消耗将是堆栈大小,加上临时存储的大小m

为此,请使用双端队列和运行总计。将新遇到的值推送到列表中,同时添加到运行总计,当队列达到大小m时,弹出列表并从运行总计中减去。

这是一些伪代码:

initialize the running total to 0

for each item in the list:
  add the current item to the running total
  push the current value onto the end of the dequeue
  if dequeue.length > m:
    pop off the front of the dequeue
    subtract the popped value from the running total
  assign the running total to the current sum slot in the list

reset the index to the beginning of the list

while the dequeue isn't empty:
  add the item in the list at the current index to the running total
  pop off the front of the dequeue
  subtract the popped value from the running total
  assign the running total to the current sum slot in the list
  increment the index

这不是递归的,而是迭代的。

此算法的运行如下(对于m = 3):

value   sum slot   overwritten sum slot
2       2          92
5       7          74
8       15         70
0       15         15
33      46
90      131
1       124
3       127
200     294
201     405
23      427
12      436
55      291

使用索引

您可以通过获取最后m个值的总和,并使用索引的偏移量而不是弹出一个出列号来删除队列和任何插槽的重新分配,例如array[i - m]

这不会减少您的执行时间,因为您仍然必须有两个循环,一个用于构建正在运行的计数,另一个用于填充所有值。但它会将你的内存使用量减少到只有堆栈空间(有效O(1))。

这是一些伪代码:

initialize the running total to 0

for the last m items in the list:
  add those items to the running total

for each item in the list:
  add the current item to the running total
  subtract the value of the item m slots earlier from the running total
  assign the running total to the current sum slot in the list

m slots earlier是棘手的部分。您可以将其拆分为两个循环:

  • 从列表末尾开始索引,减去m,再加上i
  • 从i减去m
  • 的索引

或者您可以使用模运算来“包裹”i - m < 0

时的值
int valueToSutract = array[(i - m) % n];

答案 1 :(得分:1)

http://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm使用可更新的哈希函数,它调用http://en.wikipedia.org/wiki/Rolling_hash。这对于计算MD5 / SHA要容易得多,并且可能并不逊色。

你可以证明一些关于它的事情:它是选定常数a中的度数d的多项式。假设有人提供两段文本,你随机选择一个。碰撞的概率是多少?好吧,如果哈希值相同,减去它们会给你一个带有根的多项式。由于非零多项式最多只有d个根,并且a是随机选择的,因此概率最多为模数/ d,对于大模量来说这将是非常小的。

当然MD5 / SHA是安全的,但请参阅http://cr.yp.to/mac/poly1305-20050329.pdf以获取安全版本。

答案 2 :(得分:0)

这就是我到目前为止所得到的。我只是错过了不需要花时间的步骤,例如比较哈希数组和打开文件进行阅读。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace RecursiveHashing
{
    static class Utilities
    {

        // used for circular arrays. If my circular array is of size 5 and it's
        // current position is 2 if I shift 3 units to the left I shouls be in index
        // 4 of the array.
        public static int Shift(this int number, int shift, int divisor)
        {
            var tempa = (number + shift) % divisor;
            if (tempa < 0)
                tempa = divisor + tempa;
            return tempa;
        }

    }
    class Program
    {
        const int CHUNCK_SIZE = 4; // split the files in chuncks of 4 bytes

        /* 
         * formula that I will use to compute hash
         * 
         *      formula =  sum(chunck) * (a[c]+1)*(a[c-1]+1)*(a[c-2]+1)*(-1^a[c])
         *      
         *          where:
         *              sum(chunk)  = sum of current chunck
         *              a[c]        = current byte
         *              a[c-1]      = last byte
         *              a[c-2]      = last last byte
         *              -1^a[c]     = eather -1 or +1  
         *              
         *      this formula is efficient because I can get the sum of any current index by keeping trak of the overal sum
         *      thus this algorithm should be of order n
         */

        static void Main(string[] args)
        {
            Part1(); // Missing implementation to open file for reading
            Part2();
        }



        // fist part compute hashes on first file
        static void Part1()
        {
            // pertend file b reads those bytes
            byte[] FileB = new byte[]{2,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11,};

            // create an array where to store the chashes
            // index 0 will use a fast hash algorithm. index 1 will use a more secure hashing algorithm
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            // used to track on what index of the file we are at
            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array  needed to remember the last few bytes
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array  needed to remember the last sums
            int index = 0; // position where in circular array

            int numberOfHashes = 0; // number of hashes created so far


            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    if (counter % CHUNCK_SIZE == 0 || counter == FileB.Length)
                    {
                        // get the sum of the last chunk
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);
                        Int64 tempHash = (Int64)a;

                        // conpute my hash function
                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));


                        // add the hashes to the array
                        hashes[numberOfHashes, 0] = tempHash;
                        numberOfHashes++;

                        hashes[numberOfHashes, 1] = -1;// later store a stronger hash function
                        numberOfHashes++;

                        // MISSING IMPLEMENTATION TO STORE A SECOND STRONGER HASH FUNCTION

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1); // if index is out of bounds in circular array place it at position 0
                }
            }
        }


        static void Part2()
        {
            // simulate file read of a similar file
            byte[] FileB = new byte[]{1,3,5,8,2,0,1,0,0,0,1,2,4,5,6,7,8,2,3,4,5,6,7,8,11};            

            // place where we will place all matching hashes
            Int64[,] hashes = new Int64[(FileB.Length / CHUNCK_SIZE) + 10, 2];


            int counter = 0;
            byte[] current = new byte[CHUNCK_SIZE + 1]; // circual array
            UInt64[] sum = new UInt64[CHUNCK_SIZE + 1]; // circual array
            int index = 0; // position where in circular array



            while (counter < FileB.Length)
            {
                int i = 0;
                for (; i < CHUNCK_SIZE; i++)
                {
                    if (counter == 0)
                    {
                        sum[index] = FileB[counter];
                    }
                    else
                    {
                        sum[index] = FileB[counter] + sum[index.Shift(-1, CHUNCK_SIZE + 1)];
                    }
                    current[index] = FileB[counter];
                    counter++;

                    // here we compute the hash every time and we are missing implementation to 
                    // check if hash is contained by the other file
                    if (counter >= CHUNCK_SIZE)
                    {
                        var a = (sum[index] - sum[index.Shift(1, CHUNCK_SIZE + 1)]);

                        Int64 tempHash = (Int64)a;

                        tempHash = tempHash * ((Int64)current[index] + 1)
                                          * ((Int64)current[index.Shift(-1, CHUNCK_SIZE + 1)] + 1)
                                          * ((Int64)current[index.Shift(-2, CHUNCK_SIZE + 1)] + 1)
                                          * (Int64)(Math.Pow(-1, current[index]));

                        if (counter == FileB.Length)
                            break;
                    }

                    index++;
                    index = index % (CHUNCK_SIZE + 1);
                }
            }
        }
    }
}

使用相同算法

在表中表示的相同文件
                        hashes
bytes       sum Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
2       2                   
3       5                   
5       10  5   3   2   -1  
8       18  8   5   3   1   3888
2       20  2   8   5   1   
0       20  0   2   8   1   
1       21  1   0   2   -1  
0       21  0   1   0   1   6
0       21  0   0   1   1   
0       21  0   0   0   1   
1       22  1   0   0   -1  
2       24  2   1   0   1   18
4       28  4   2   1   1   
5       33  5   4   2   -1  
6       39  6   5   4   1   
7       46  7   6   5   -1  -7392
8       54  8   7   6   1   
2       56  2   8   7   1   
3       59  3   2   8   -1  
4       63  4   3   2   1   1020
5       68  5   4   3   -1  
6       74  6   5   4   1   
7       81  7   6   5   -1  
8       89  8   7   6   1   13104
11      100 11  8   7   -1  -27648






file b                          
                            rolling hashes
bytes       0   Ac  A[c-1]  A[c-2]  -1^Ac   sum * (Ac+1) * (A[c-1]+1) * (A[c-2]+1)
1       1                   
3       4                   
5       9   5   3   1   -1  
8       17  8   5   3   1   3672
2       19  2   8   5   1   2916
0       19  0   2   8   1   405
1       20  1   0   2   -1  -66
0       20  0   1   0   1   6
0       20  0   0   1   1   2
0       20  0   0   0   1   1
1       21  1   0   0   -1  -2
2       23  2   1   0   1   18
4       27  4   2   1   1   210
5       32  5   4   2   -1  -1080
6       38  6   5   4   1   3570
7       45  7   6   5   -1  -7392
8       53  8   7   6   1   13104
2       55  2   8   7   1   4968
3       58  3   2   8   -1  -2160
4       62  4   3   2   1   1020
5       67  5   4   3   -1  -1680
6       73  6   5   4   1   3780
7       80  7   6   5   -1  -7392
8       88  8   7   6   1   13104
11      99  11  8   7   -1  -27648