Question

成像序列是Pi

... 141592653589793238462643383279502884197

Pi存储在文本文件中。

我想在Pi中找到一个类似的子序列，例如80％的相似性。

例如我想在Pi中找到 33384 ，所以

14159265358979 32384 62643383279502884197 ....

位数约为数百万。

我需要一种有效的算法来搜索这些相似性。

我应该使用数据库代替文件吗？

任何想法都赞赏。

修改：

我找到了一些算法，我需要检查它，然后我会告诉你结果。

BTW算法为Knuth–Morris–Pratt

Answer 1

您可以通过pi序列提取M个字符（M - 搜索长度）子序列。然后将子序列与搜索字符串进行比较。

然后只是XOR搜索和子序列。 XOR计数后不是0字节。计数是差异的数量。将差异计数与搜索字符串长度进行比较可以得出差异百分比。

如果差异合适，则会获得类似的子字符串。

更新：

你将得到N-M次子，并且比较complexisy是O（M）。 N是pi字符串长度，M是子字符串长度

Answer 2

我认为C++ program就像这样简单。这是我自己的算法，就像我看到的那样。我试图对它进行一些优化，以便在确认已经存在如此多的不匹配元素时，内部for不会循环，因此不需要重置搜索字符串的测试：

int main()
{
   char *input = "141592653589793238462643383279502884197";
   char *test  = "33384";
   double pc   = .8;

   int len  = strlen(test);
   int good = len * pc;

   for(int pos = 0; pos < strlen(input)-len; ++pos)
   {
       int matches = 0;
       for(int pos1 = 0; pos1 < len; ++pos1)
       {
           if(test[pos1] == input[pos + pos1]) matches++;

           // below is a small optimisation attempt. It should significantly improve the performance
           if(len - pos1 < good - matches)
                break; // exiting earlier. no reason to stay in the loop
       }

       if(matches >= good)
       {
            cout << pos << " " << input + pos << endl;
       }
   }
   return 0;
}

请使用您的真实数据进行测试，并告知它有多快。

Answer 3

我将自由地提供我自己的80％相似性的定义：在给定的序列中，当序列中的位置被随机选择时，序列中该位置的数字有80％的可能性匹配目标序列相同位置的数字。

鉴于该定义及其相当模糊的性质，我提出了一种相对有效的启发式算法，该算法通常会为相当长的子序列返回正确的结果：

遍历pi中的每个数字（最多给定长度）。对于每一个，从当前位置（范围从0到子序列-1的长度）的随机偏移中采样几个（取决于子序列的长度）数字。如果您无法得出具有合理统计意义的结论，该地区的匹配率低于80％，请将此位置标记为稍后返回。
使用较大的样本大小重新测试每个标记区域，直到样本大小接近子序列的长度。
验证其余区域，看看哪些区域确实至少有80％匹配。

Answer 4

这是我在C＃中创建的代码示例，将pi作为字符串传递。您可以将此转换为流或其他内容，具体取决于您获取pi的方式。它将返回部分匹配的字符串列表。

    [TestMethod]
    public void TestMethod1()
    {
        string pi = "141592653589793238462643383279502884197";
        string searchStr = "33384";
        double matchPercentage = 0.8;
        var matchedStrings = GetMatchedStrings(searchStr, matchPercentage, pi);
        Assert.AreEqual(1, matchedStrings.Count);

        searchStr = "141";
        matchPercentage = 0.6666;
        matchedStrings = GetMatchedStrings(searchStr, matchPercentage, pi);
        Assert.AreEqual(2, matchedStrings.Count);
    }

    private static List<string> GetMatchedStrings(string searchStr, double matchPercentage, string pi)
    {
        int currentPosition = 0;
        int searchLength = searchStr.Length;
        int minMatchCount = Convert.ToInt32(searchLength*matchPercentage);
        List<string> matchedStrings = new List<string>();

        while (currentPosition + searchLength <= pi.Length)
        {
            int matchedCount = 0;
            int checkedCount = 0;
            while (searchLength - checkedCount + matchedCount >= minMatchCount)
            {
                if (searchStr[checkedCount] == pi[currentPosition + checkedCount])
                    matchedCount++;
                if (matchedCount >= minMatchCount)
                {
                    matchedStrings.Add(pi.Substring(currentPosition, searchLength));
                    break;
                }
                checkedCount ++;
            }
            currentPosition ++;
        }
        return matchedStrings;
    }

Answer 5

declare @string varchar(500) = '141592653589793238462643383279502884197'

DECLARE @start int

set @start = patindex('%32384%', @string)

print @start

--get start position

注意：如果未找到值，则参考PatIndex

在非常大的序列中找到相似的子序列

5 个答案: