Question

我需要检测只有标题的多个柱状数据块的存在。除了标题词之外，没有其他任何关于数据的知识，标题词对于每组数据都是不同的。

重要的是，事先不知道每个块中有多少个字，因此，有多少个块。

同样重要的是，单词列表总是相对较短 - 小于20。

所以，给定一个标题或一组标题词，如：

Opt
Object
Type
Opt
Object
Type
Opt
Object
Type

确定它完全由重复序列组成的处理效率最高的方法是什么：

Opt
Object
Type

它必须是完全匹配，所以我的第一个想法是搜索[1+]寻找匹配到[0]，称它们为索引n，m，...然后如果它们是等距的检查[1] == [n + 1] == [m + 1]，[2] == [n + 2] == [m + 2]等。

编辑：它必须适用于单词集，其中一些单词本身在一个块中重复，所以

Opt
Opt
Object
Opt
Opt
Object

是一组2

Opt
Opt
Object

Answer 1

单位序列可以包含自己的重复吗？你知道单位序列的长度吗？

e.g。

ABCABCABCDEFABCABCABCDEFABCABCABCDEF

单位序列为ABCABCABCDEF

如果答案是肯定的，我认为你遇到了一个难题，除非你知道单位序列的长度（在这种情况下解决方案是微不足道的，你只需要制作一个首先存储单位序列的状态机，然后验证序列的每个元素其余部分对应于单元序列的每个元素）。

如果答案为否，请使用此变体Floyd's cycle-finding algorithm来识别单位序列：

将指针P1和P2初始化为序列的开头。
对于每个新元素，每次递增指针P1，并每隔一段时间递增指针P2（保持一个计数器来执行此操作）。
如果P1指向P2的相同元素，则表示您已找到单位序列。
现在重复序列的其余部分以验证它是否包含重复项。

更新：您已澄清您的问题，以说明单位序列可能包含自己的重复。在这种情况下，使用循环查找算法，但它只能保证找到潜在的循环。使其在整个序列中保持运行，并使用以下状态机，从状态1开始：

状态1：没有找到有效的循环;继续寻找。当循环寻找算法找到潜在循环时，验证您是否从P获得了2个初步单位序列的副本，并转到状态2.如果到达输入的末尾，请转到状态4.

状态2：找到初步单位序列。只要循环重复相同，就运行输入。如果到达输入的末尾，请转到状态3.如果找到的输入元素与单位序列的相应元素不同，请返回到状态1.

状态3：如果输入的结尾包含单元序列的完全重复，则输入是单元序列的重复。（如果它在单位序列的中间，例如ABCABCABCABCAB，则找到一个单位序列，但它不包括完全重复。）

状态4：未找到单位序列。

在我的例子中（重复ABCABCABCDEF），算法首先找到ABCABC，它将把它置于状态2，并且它将保持在那里直到它达到第一个DEF，这将把它放回状态1，然后可能在状态1和2之间来回跳转，直到它到达第二个ABCABCABCDEF，此时它将重新进入状态2，并且在输入结束时它将处于状态3。

Answer 2

如果列表由x个重复组组成，那么每个组包含n个元素......

我们知道至少有一组，所以我们将看看是否有2个重复组，通过比较列表的前半部分和下半部分进行测试。

1）如果以上情况属实，我们知道解决方案是2

的因素

2）如果上述内容为假，我们将移动到下一个最大的素数，该素数可以被总字数整除...

在每个步骤中，我们检查列表之间的相等性，如果我们发现它，那么我们就知道我们有一个解决方案。

我们想要返回一个单词列表，其中我们找到了第一个素数的最大因子，我们发现它们在子列表中是相等的。

因此，我们在子列表中应用上述公式，知道所有子列表都相等...因此，解决方案最好以递归方式求解。那就是我们只需要孤立地考虑当前的子列表。

如果加载一个简短的素数表，解决方案将非常有效...在此之后，有必要计算它们，但如果即使只有几十个素数的列表，列表也必须是非常简单的考虑到了。

Answer 3

我的解决方案，根据需要工作，也许是天真的。它确实具有简单的优点。

String[]                            wta;                                    // word text array
...
INTERVAL:
for(int xa=1,max=(wta.length/2); xa<=max; xa++) {
    if((wta.length%xa)!=0) { continue; }                                    // ignore intervals which don't divide evenly into the words
    for(int xb=0; xb<xa; xb++) {                                            // iterate the words within the current interval
        for(int xc=xb+xa; xc<wta.length; xc+=xa) {                          // iterate the corresponding words in each section
            if(!wta[xb].equalsIgnoreCase(wta[xc])) { continue INTERVAL; }   // not a cycle
            }
        }
    ivl=xa;
    break;
    }

Answer 4

比我的另一个更好的答案：一个有效的Java实现应该很容易理解，并且是通用的：

package com.example.algorithms;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Iterator;
import java.util.List;

interface Processor<T> {
    public void process(T element);
}

public class RepeatingListFinder<T> implements Processor<T> {

    private List<T> unit_sequence = new ArrayList<T>();
    private int repeat_count = 0;
    private int partial_matches = 0;
    private Iterator<T> iterator = null;

    /* Class invariant:
     * 
     * The sequence of elements passed through process()
     * can be expressed as the concatenation of
     *        the unit_sequence repeated "repeat_count" times,
     *   plus the first "element_matches" of the unit_sequence.
     * 
     * The iterator points to the remaining elements of the unit_sequence,
     * or null if there have not been any elements processed yet.
     */

    public void process(T element) {
        if (unit_sequence.isEmpty() || !iterator.next().equals(element))
        {
            revise_unit_sequence(element);
            iterator = unit_sequence.iterator();
            repeat_count = 1;
            partial_matches = 0;
        }
        else if (!iterator.hasNext())
        {
            iterator = unit_sequence.iterator();
            ++repeat_count;
            partial_matches = 0;
        }
        else
        {
            ++partial_matches;
        }
    }

    /* Unit sequence has changed. 
     * Restructure and add the new non-matching element. 
     */
    private void revise_unit_sequence(T element) {
        if (repeat_count > 1 || partial_matches > 0)
        {
            List<T> new_sequence = new ArrayList<T>();
            for (int i = 0; i < repeat_count; ++i)
                new_sequence.addAll(unit_sequence);
            new_sequence.addAll(
                    unit_sequence.subList(0, partial_matches));

            unit_sequence = new_sequence;
        }
        unit_sequence.add(element);
    }

    public List<T> getUnitSequence() { 
        return Collections.unmodifiableList(unit_sequence);
    }
    public int getRepeatCount() { return repeat_count; }
    public int getPartialMatchCount() { return partial_matches; }
    public String toString()
    {
        return "("+getRepeatCount()
        +(getPartialMatchCount() > 0 
            ? (" "+getPartialMatchCount()
                +"/"+unit_sequence.size())
            : "")
        +") x "+unit_sequence;
    }

    /********** static methods below for testing **********/

    static public List<Character> stringToCharList(String s)
    {
        List<Character> result = new ArrayList<Character>();
        for (char c : s.toCharArray())
            result.add(c);
        return result;
    }

    static public <T> void test(List<T> list)
    {
        RepeatingListFinder<T> listFinder 
            = new RepeatingListFinder<T>();
        for (T element : list)
            listFinder.process(element);
        System.out.println(listFinder);
    }

    static public void test(String testCase)
    {
        test(stringToCharList(testCase));
    }

    static public void main(String[] args)
    {
        test("ABCABCABCABC");
        test("ABCDFTBAT");
        test("ABABA");
        test("ABACABADABACABAEABACABADABACABAEABACABADABAC");
        test("ABCABCABCDEFABCABCABCDEFABCABCABCDEF");
        test("ABABCABABCABABDABABDABABC");      
    }
}

这是一种面向流的方法（具有O（N）执行时间和O（N）最坏情况空间要求）;如果要处理的List<T>已经存在于内存中，则应该可以重写此类来处理List<T>而无需任何额外的空间要求，只需跟踪重复计数和部分匹配计数， List.subList（）创建一个单元序列，它是输入列表的前K个元素的视图。

我该如何找到重复的单词序列

4 个答案: