重复的文本查找

时间:2009-05-13 18:54:17

标签: c# c++ text compression duplicates

我的主要问题是尝试找到一个合适的解决方案来自动转换它,例如:

d+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+

进入这个:

[d+c+d+f+]4

即。找到彼此相邻的重复项,然后从这些重复项中缩短“循环”。 到目前为止,我找不到合适的解决方案,我期待着回应。附:为了避免混淆,前面提到的样本并不是唯一需要“循环”的东西,它因文件而异。哦,这是用于C ++或C#程序,要么很好,尽管我也接受任何其他建议。此外,主要思想是所有工作都将由程序本身完成,除了文件本身之外没有用户输入。 这是完整的文件,供参考,我对拉伸页面表示歉意:  #0 @ 16 v225 y10 w250 t76

L16 $ ED $ EF $ A9 p20,20 > ecegb> d< bgbgecgec<克 > d +&LT b取代; d + F + A +&以及c +< A + F + A + F + d + LT b取代; F + d + LT; BF + &以及c&LT a取代; cegbgegec&LT a取代; EC< AE > d + C + d + F + d + C + d + F + d + C + d + F + d + C + d + F + R1 ^ 1

/ L8 r1r1r1r1 F +< A +> F + G + CG + R4 A + C + A + G + CG + R4F + LT; A +> F + G + CG + R4 A + C + A + G + CG + R4F + LT; A +> F + G + CG + R4 A + C + A + G + CG + R4 F +< A +> F + G + CG + R4 A + C + A + G + r4g + 16f16c + 一个+ 2 ^ G + F + G + 4 F + FF + 4FD + F4 d + C + d + 4C +℃下A + 2 ^ 4 > C4D + < G + 2 ^ 4R4 ^ 一个+以及c + d + 4G + 4A + 4 R1 ^ 2 ^ 4 ^一个+ 2 ^ G + F + G + 4 F + FF + 4FD + F4 d + C + d + 4C +℃下A + 2 ^ 4 > C4D + < G + 2 ^ 4R4 ^ 一个+以及c + d + 4G + 4A + 4 R1 ^ 2 ^ 4 ^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#4 @ 22 v250 y10

L8 O3 RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + RG + / r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#2 @ 4 v155 y10

L8 $ ED $ F8 $ 8F O4 r1r1r1 d + 4f4f + 4G + 4 一个+ 4R1 ^ 4 ^ 2 / d + 4 ^ FR2 F + 4 ^ fr2d + 4 ^ FR2 F + 4 ^ fr2d + 4 ^ FR2 F + 4 ^ fr2d + 4 ^ FR2 F + 4 ^ FR2 > d + 4 ^ FR2 F + 4 ^ fr2d + 4 ^ FR2 F + 4 ^ FR2 < F + 4 ^ G + R2 F + 4 ^ fr2f + 4 ^ G + R2 F + 4 ^ fr2f + 4 ^ G + R2 F + 4 ^ fr2f + 4 ^ G + R2 F + 4 ^ fr2f + 4 ^ G + R2 F + 4 ^ fr2f + 4 ^ G + R2 F + 4 ^ fr2f + 4 ^ G + R2 F + 4 ^ fr2f + 4 ^ G + R2 F + 4 ^ FR2 > 一个+ 4 ^ G + R2 F + 1A + 4 ^ G + R2 F + 1 F + 4 ^ FR2 d + 1 F + 4 ^ FR2 d + 2 ^ d + 4 ^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#3 @ 10 v210 y10

R1 ^ 1 O3 c8r8d8r8 c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8 C8 @ 10d16d16 @ 21 C8 @ 10d16d16 @ 21 C8 @ 10d16d16 @ 21 / C4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8 C4 @ 10d8 @ 21c8< B8> @ 10d16d16d16d16d16r16 C4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8c4 @ 10d8 @ 21c8< B8> C8 @ 10d8 @ 21c8 C4 @ 10d8 @ 21c8 @ 10b16b16> c16c16< b16b16a16a16

#7 @ 16 v230 y10

L16 $ ED $ EF $ A9 cceeggbbggeeccee < BB> d + d + F + F + A + A + F + F + d + d + LT; BB> d + d + < AA> cceeggeecc< AA> CC < G + G + BB> d + d + + FFD d + LT; BBG + G + BB / r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#5 @ 4 v155 y10

L8 $ ED $ F8 $ 8F O4 r1r1r1r1 d + 4R1 ^ 2 ^ 4 / <一个+ 4 ^> CR2 C + 4 ^ CR2<一个+ 4 ^> CR2 C + 4 ^ CR2<一个+ 4 ^> CR2 C + 4 ^ CR2<一个+ 4 ^> CR2 C + 4 ^ CR2 一个+ 4 ^> CR2 C + 4 ^ CR2 <一个+ 4 ^> CR2 C + 4 ^ C r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1 R2 F + 4 ^ FR2 d + 1F + 4 ^ FR2 d + 1 C + 4 ^ CR2 < A + 1 &以及c + 4 ^ CR2 < A + 2 ^一个+ 4 ^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

4 个答案:

答案 0 :(得分:2)

您可以使用Smith-Waterman算法进行局部对齐,将字符串与自身进行比较。

http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

编辑:要使算法适应自对齐,您需要将对角线中的值强制为零 - 也就是说,惩罚将整个字符串与其自身对齐的简单解决方案。然后会弹出“第二好”的对齐方式。这将是最长的两个匹配子串。重复相同的事情,找到逐渐缩短的匹配子串。

答案 1 :(得分:2)

不确定这是否是您要找的。

我把字符串“testtesttesttest4notaduped + c + d + f + d + c + d + f + d + c + d + f + d + c + d + f + testtesttest”并将其转换为“[test] 4 4notadupe [d + c + d + f +] 4 [test] 3“

我确信有人会提出更有效的解决方案,因为在处理完整文件时它会有点慢。我期待着其他答案。

        string stringValue = "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest";

        for(int i = 0; i < stringValue.Length; i++)
        {
            for (int k = 1; (k*2) + i <= stringValue.Length; k++)
            {
                int count = 1;

                string compare1 = stringValue.Substring(i,k);
                string compare2 = stringValue.Substring(i + k, k);

                //Count if and how many duplicates
                while (compare1 == compare2) 
                {
                    count++;
                    k += compare1.Length;
                    if (i + k + compare1.Length > stringValue.Length)
                        break;

                    compare2 = stringValue.Substring(i + k, compare1.Length);
                } 

                if (count > 1)
                {
                    //New code.  Added a space to the end to avoid [test]4 
                    //turning using an invalid number ie: [test]44.
                    string addString = "[" + compare1 + "]" + count + " ";

                    //Only add code if we are saving space
                    if (addString.Length < compare1.Length * count)
                    {
                        stringValue = stringValue.Remove(i, count * compare1.Length);
                        stringValue = stringValue.Insert(i, addString);
                        i = i + addString.Length - 1;
                    }
                    break;
                }
            }
        }

答案 2 :(得分:1)

LZW可以提供帮助:它使用前缀字典来搜索重复模式,并使用对先前条目的引用替换此类数据。我认为根据您的需求调整它应该不难。

答案 3 :(得分:0)

为什么不使用System.IO.Compression