在Java

时间:2015-11-15 19:34:16

标签: java algorithm sequence subsequence

我在一个不相关的程序中遇到了这个问题我正在编写,我花了几个小时试图解决它,因为我觉得它很有趣。但是我无法一直这样做。我的代码只解决了一些子集的序列。这个问题也像几十年来可能以各种方式解决的一般数学问题,但我缺乏数学技能和术语来找到解决方案,或者确实在线解决这个特定问题。

我有一组子序列,我知道它们是更大,未知(超级?)序列的一部分。我不认为这些子序列在数学意义上是 sets 因为它们是它们是相似的,因为它们不包含重复元素。 master / super / whateversequence也是如此。 (为清楚起见,我将这称为超级序列。)

子序列都包含相同类型的数据,但数据不按字母顺序,按升序排序,或类似的任何顺序排序。从某种意义上说,数据是以任意顺序:在超级序列中。这就是我感兴趣的内容。我想找到这些子序列中未知的超级序列。

为了简单起见,我尝试使用字母表中的字母来解决这个问题,但我可以稍后重构代码以满足我的需要。显然,因为我仍然试图解决这个问题,我首先想出一个不包含重复元素的超级序列的合适词: FLOWCHARTS

然后我想出了以下六个子序列:

F,W,C,R
L,H,A
L,O,H,A,R,S
C,S
R,T,S
F,O,W,H,A,S

这是我的序列排序方法:

// LinkedHashMappedKeyValueList keeps the data in the order it was inserted and allows one key to have multiple values.
private static LinkedHashSet<Character> orderSequence(final Set<Character> unorderedSequence, final LinkedHashMappedKeyValueList ruleMap)
{
    List<Character> orderedSequence = new ArrayList<Character>(unorderedSequence);

    // Order the sequence according to the rules.
    System.out.println("---- ORDERING SEQUENCE ----");

    for (Map.Entry<Character, LinkedHashSet<Character>> rule : ruleMap.entrySet())
    {
        char currentChar = rule.getKey();
        LinkedHashSet<Character> ruleChars = rule.getValue();

        System.out.println("Processing rule " + currentChar + "<" + ruleChars.toString());

        if (orderedSequence.contains(currentChar))
        {
            int ruleCharIndex = -1;
            int smallestRuleCharIndex = Integer.MAX_VALUE;
            Iterator<Character> it = ruleChars.iterator();

            // Find the rule character with the smallest index.
            while (it.hasNext())
            {
                char ruleChar = it.next();
                ruleCharIndex = orderedSequence.indexOf(ruleChar);
                System.out.println("\tChecking for rule character: " + ruleChar + " (" + ruleCharIndex + ")");

                if (ruleCharIndex > -1 && smallestRuleCharIndex > ruleCharIndex)
                    smallestRuleCharIndex = ruleCharIndex;
            }

            if (smallestRuleCharIndex != Integer.MAX_VALUE)
                System.out.println("\tMoving '" + currentChar + "' before '"
                        + orderedSequence.get(smallestRuleCharIndex) + "'.");
            else
                System.out.println("\tMoving '" + currentChar + "' to the end of the sequence.");

            if (!moveBefore(orderedSequence.indexOf(currentChar), smallestRuleCharIndex, orderedSequence))
                System.out.println("\tAlready in correct position.");
            else
                System.out.println("\tCurrent sequence: " + listToString(orderedSequence));
        }
        else
            throw new ArithmeticException("Element of a subsequence not a part of the sequence.");
    }

    return new LinkedHashSet<Character>(orderedSequence);
}

最后,我的代码找到了这些子序列的超级序列F,L,O,W,H,C,A,R,T,S,这些序列非常接近但并不完美。我还需要多次运行我的排序方法,因此我提出的“算法”也不完美。 “规则映射”的东西是一个哈希映射,其中键是字符对象的另一个哈希映射,它位于子序列中的关键字符之后(因此在超级序列中)。

我是否可以使用某种类型的Java库进行这种序列查找?在告诉我这是什么叫做和/或帮我找到合适的算法时,有人能指出我正确的方向吗?

此外,我的程序缩短了控制台输出:

---- BUILDING RULE MAP ----
Subsequences:   F,W,C,R
        L,H,A
        L,O,H,A,R,S
        C,S
        R,T,S
        F,O,W,H,A,S

All subsequences processed. Number of ordering rules: 10
Rule map: (F<[W, O]),(W<[C, H]),(C<[R, S]),(R<[, S, T]),(L<[H, O]),(H<[A]),(A<[, R, S]),(O<[H, W]),(S<[]),(T<[S])

---- BUILDING UNORDERED SEQUENCE ----
Sequence size is 10.
Unordered sequence: F,W,C,R,L,H,A,O,S,T

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'W'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Already in correct position.
Processing rule R<[, S, T]
    Moving 'R' before 'S'.
    Current sequence: F,W,C,L,H,A,O,R,S,T
Processing rule L<[H, O]
    Moving 'L' before 'H'.
    Already in correct position.
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,W,C,L,H,O,A,R,S,T
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Current sequence: F,O,W,C,L,H,A,R,S,T
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Current sequence: F,O,W,C,L,H,A,R,T,S
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,W,C,R,L,H,A,O,S,T
Ordered sequence:   F,O,W,C,L,H,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Already in correct position.
Processing rule W<[C, H]
    Moving 'W' before 'C'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: F,O,W,L,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,O,W,C,L,H,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    false

---- ORDERING SEQUENCE ----
Processing rule F<[W, O]
    Moving 'F' before 'O'.
    Current sequence: L,F,O,W,H,C,A,R,T,S
Processing rule W<[C, H]
    Moving 'W' before 'H'.
    Already in correct position.
Processing rule C<[R, S]
    Moving 'C' before 'R'.
    Current sequence: L,F,O,W,H,A,C,R,T,S
Processing rule R<[, S, T]
    Moving 'R' before 'T'.
    Already in correct position.
Processing rule L<[H, O]
    Moving 'L' before 'O'.
    Current sequence: F,L,O,W,H,A,C,R,T,S
Processing rule H<[A]
    Moving 'H' before 'A'.
    Already in correct position.
Processing rule A<[, R, S]
    Moving 'A' before 'R'.
    Current sequence: F,L,O,W,H,C,A,R,T,S
Processing rule O<[H, W]
    Moving 'O' before 'W'.
    Already in correct position.
Processing rule S<[]
    Moving 'S' to the end of the sequence.
    Already in correct position.
Processing rule T<[S]
    Moving 'T' before 'S'.
    Already in correct position.
Previous sequence:  F,L,O,W,H,C,A,R,T,S
Ordered sequence:   F,L,O,W,H,C,A,R,T,S
Sequences match:    true
Sequence ordered according to the limits of the rule map.
Sequence found after 2 tries.

Expected sequence:  F,L,O,W,C,H,A,R,T,S FLOWCHARTS
Found sequence:     F,L,O,W,H,C,A,R,T,S FLOWHCARTS
Sequences match:    false

2 个答案:

答案 0 :(得分:1)

您要求的是从部分订单计算总订单。我在这方面找不到太多工作。但是我们可以在这里讨论一下这个问题。

考虑A<B<C<D。如果我们有序列A<CB<DC<D,我们将永远无法计算总订单。我们只获得A<C<DB<D

我认为可以证明我们需要N-1形式的所有X<Y关系,XY连续出现在最后一个链中重建总订单(可能还有其他订单,但这些是额外的信息)。作为非严格的演示,假设我们有A1<A2<A3<...<AN并假设我们能够从部分订单A_begin重建为A_end。现在,为了使其适应整个订单中的正确位置,我们需要知道A_(begin-1)<A_begin。没有其他关系可以让它适合整个订单。继续向下进入A_begin..A_end我们应该能够通过某种归纳/无限下降来表明我们将需要由该词的连续字符给出的所有关系以便重建它。

上述序列集中缺少的信息是F<LC<H。可以获得的序列是W->C->R->T->SF->O->W->H->A->R->T->SL->O->W->H->A->R->T->S。计算余数需要更多信息。

在上述情况下,我们在分解和重复消除之后有以下关系:

A,R
A,S <-- redundant since R<S and A<R
C,R
C,S <-- redundant since R<S and A<R
F,O 
F,W <-- redundant since O<W and F<O
H,A
L,H <-- redundant since O<H and L<O
L,O
O,H <-- redundant since O<W and W<H
O,W
R,S <-- redundant since T<S and R<T
R,T
T,S
W,C
W,H

有16个关系,其中6个立即是多余的。删除冗余我们得到以下10个关系:

A,R <-- consecutive letters in actual word
C,R
F,O 
H,A <-- consecutive letters in actual word
L,O <-- consecutive letters in actual word
O,W <-- consecutive letters in actual word
R,T <-- consecutive letters in actual word
T,S <-- consecutive letters in actual word
W,C <-- consecutive letters in actual word
W,H 

原始序列中唯一缺失的是F<LC<H。额外的给定关系C<RF<OW,H重复LHS或RHS,并提供不可操作的信息(基本上这些链接两个部分有序的链但不在终点,所以你知道链条要合并或小于另一条链但不知道在哪里。

添加缺失关系后,有多种方法可以实现此目的。您的代码可能单独工作。

答案 1 :(得分:0)

我在问题中描述的问题是shortest common supersequence problem。我没有在网上搜索,因为我在编写代码时过于专注于数学集,只有在我开始写出问题之后,我才意识到我正在使用序列。事实上,我以为我只是编造了“#eteequence&#34;当场。事实证明,这正是我需要在线搜索以找到与此问题相关的几页材料。

Oliver Charlesworth's comment绝对正确。给定的子序列缺少信息来产生正确的超级序列,所以在我的例子中,没有办法找到实际的超级序列。关于生物信息学的Coffee's comment最有可能是指最短的常见超弦问题,它可用于重建DNA序列,这在this question中有所讨论。最短的常见超序和超串问题在初看时非常相似,但是,在超弦问题中,子串必须由超弦的相邻元素组成,与超序问题不同。 (如下图所示。)

Difference between the shortest common supersequence and superstring problems.

这两个问题也是NP-complete或&#34; NP-hard&#34;。我理解它的方式是,这些问题没有最佳解决方案或一些神奇的算法,你可以复制粘贴到你的代码中。只有近似值和足够好的&#34;的解决方案。

令人惊讶的是,我无法找到与此问题相近的完整Java代码,但我确实遇到过一些包含其他语言代码的网站。我在下面列出了一些我在研究这个问题时发现的有用资源。我还提供了与最短的常见超弦问题相关的资源,因为它是相关的。

进一步研究的其他资源