如何在没有冲突的情况下快速散列非常大的子串?

时间:2016-11-02 05:21:59

标签: c++ algorithm c++11 hash

我有一个应用程序,它的一部分找到输入字符串的所有回文子串。输入字符串的长度最多可达100,000,因此子字符串可能非常大。例如,应用程序的一个输入导致超过300,000个子串回文长度超过10,000。该应用程序稍后会计算所有回文的相等性,并使用在找到回文函数的函数中使用标准哈希的哈希计算唯一的回答。散列存储在向量中,然后在应用程序中计算唯一性。这种输入和输出条件的问题是,非常大的子串的散列需要太长时间,加上哈希中的碰撞。所以我想知道是否有一个算法(哈希)可以快速和唯一地哈希一个非常大的子字符串(最好是子字符串的索引范围用于速度,但具有唯一性的准确性)。散列在函数get_palins的末尾完成。代码如下。

#include <iostream>
#include <string>
#include <cstdlib>
#include <time.h>
#include <vector>
#include <algorithm>
#include <unordered_map>
#include <map>
#include <cstdio>
#include <cmath>
#include <ctgmath>

using namespace std;

#define MAX 100000
#define mod 1000000007

vector<long long> palins[MAX+5];

//  Finds all palindromes for the string
void  get_palins(string &s)
{
     int N = s.length();
     int i, j, k,   // iterators
     rp,        // length of 'palindrome radius'
     R[2][N+1]; // table for storing results (2 rows for odd- and even-length palindromes

     s = "@" + s + "#"; // insert 'guards' to iterate easily over s

     for(j = 0; j <= 1; j++)
     {
         R[j][0] = rp = 0; i = 1;

         while(i <= N)
         {
             while(s[i - rp - 1] == s[i + j + rp]) {  rp++;  }
             R[j][i] = rp;
             k = 1;
             while((R[j][i - k] != rp - k) && (k < rp))
             {
                 R[j][i + k] = min(R[j][i - k],rp - k);
                 k++;
             }
             rp = max(rp - k,0);
             i += k;
         }
     }

     s = s.substr(1,N); // remove 'guards'

     for(i = 1; i <= N; i++)
     {
        for(j = 0; j <= 1; j++)
            for(rp = R[j][i]; rp > 0; rp--)
            {
                int begin = i - rp - 1;
                int end_count = 2 * rp + j;
                int end = begin + end_count - 1;
                if (!(begin == 0  && end == N -1 ))
                {
                   string ss = s.substr(begin, end_count);
                   long long hsh = hash<string>{}(ss);
                   palins[begin].push_back(hsh);

                }
          }
     }
}
unordered_map<long long, int> palin_counts;
unordered_map<char, int> end_matches;

// Solve when at least 1 character in string is different
void solve_all_not_same(string &s)
{
    int n = s.length();
    long long count = 0;

    get_palins(s);

    long long palin_count = 0;

    // Gets all palindromes into unordered map
    for (int i = 0; i <= n; i++)
    {
        for (auto& it : palins[i])
        {
            if (palin_counts.find(it)  == palin_counts.end())
            {
                palin_counts.insert({it,1});
            }
            else
            {
                palin_counts[it]++;
            }
        }
    }

    // From total palindromes, get proper border count
    // minus end characters of substrings
    for ( auto it = palin_counts.begin(); it != palin_counts.end(); ++it )
    {
        int top = it->second - 1;

        palin_count += (top * (top + 1)) / 2;
        palin_count %= mod;
    }

    // Store string character counts in unordered map
    for (int i = 0; i <= n; i++)
    {
        char c = s[i];

        //long long hsh = hash<char>{}(c);

        if (end_matches[c] == 0)
            end_matches[c] = 1;
        else
            end_matches[c]++;

    }

    // From substring end character matches, get proper border count
    // for end characters of substrings
    for ( auto it = end_matches.begin(); it != end_matches.end(); it++ )
    {
        int f = it->second - 1;
        count += (f * (f + 1)) / 2;
    }

    cout << (count + palin_count) % mod << endl;

    for (int i = 0; i < MAX+5; i++)
        palins[i].clear();
}

int main()
{

    string s; 
    cin >> s;

    solve_all_not_same(s);

    return 0;
}

1 个答案:

答案 0 :(得分:2)

面对问题 X 查找所有回文子串),您会问如何快速解决 Y 哈希子串 em>):The XY Problem
对于回文检测,请考虑后缀数组(一个用于输入的反向或附加到输入的数据) 对于重叠字符串的快速哈希,请查看rolling hashes