Question

给出单词list = {w1，w2，w3，w1，w2}

以长文字查找上述单词列表的所有排列。

长文本列表= {这是长文w1 w2 w3 w4和 w1 w2 w1 w2 w3 。这是另一个没有排列的长文本，因为它不包含所有单词w1，w2，w2，w2，w2，但这是以空格分隔的排列 w2 w2 w3 w1 w1

解决此问题的最有效算法是什么？

我认为首先为列表中的每个唯一单词分配一个元组（唯一＃，唯一素数＃）{w1 = [101,5]，w2 = [103,7]，w3 = [205,11]}并使用分配的元组计算整个列表的总和：w1 [101 * 5] + w2 [103 * 7] + w3 [205 * 11] + w1 [101 * 5] + + w2 [103 * 7] = 4707

这是pudo-code：

targetSum = 4707;
long sum = 0;
for (int i = 0;  i < Text.size(); i++){
     look up (unique #, unique prime #) 
     sum  + = ((unique # * unique prime) ;
     if(  i >  list.size() ){
         sum = sum – (look up (unique #, unique prime # for index 
                ( i – list.size()) and subtract tuple sum)
     }

    if(targetSum = = sum ){
        // this is possible match so hashMap lookup verify  again  that this reagion is actual match.
}

}

有没有更好的逻辑或算法？

更新：

我正在进一步阅读模式匹配Z算法（Z-Boxes），但我无法看到Z-box或Z-Array如何使其更好，除非所有排列都是预先知道的。不确定是否有更好的方法？

谢谢大家分享知识。

谢谢，

Bhavesh

Answer 1

如果单词必须是连续的，那么首先要建立一个你正在寻找的单词的字典，以及它们的计数。对于[w1, w2, w3, w1, w2]的示例，字典将包含：

{w1, 2}
{w2, 2}
{w3, 1}

调用传入的字典。

还要创建一个相同类型的空字典（即word，count）。称之为传出字典。

然后，构建一个与您要查找的单词数量相当的队列。队列最初是空的。

然后，你开始逐字逐句地阅读文字：

for each text_word in text
    if queue.count == number of words
        queue_word = remove word from queue
        if queue_word is in outgoing dictionary
            remove from outgoing
            add to incoming
        end if
    end if

    add text_word to queue
    if text_word is in incoming dictionary
        remove text_word from incoming dictionary
        add text_word to outgoing dictionary
        if incoming dictionary is empty
            you found a permutation
        end if
    end if

这里唯一的复杂因素是“向字典添加单词”和“将单词删除到字典”必须考虑到同一单词多次出现的可能性。所以你的删除实际上是这样的：

count = dictionary[word].Count = 1
if (count == 0)
    dictionary.Remove(word)
else
    dictionary[word].Count = count

添加类似。

Answer 2

用素数识别你的模式的想法是好的，但不同素数的乘积是唯一的，而不是它们的总和。

然后，您可以使用移动窗口方法。计算您的模式与前五个单词的乘积。然后通过将产品从左侧分开并向右侧移动来移动窗口。不在您的模式中的所有单词都具有1的中性值。

def permindex(text, pattern, start = 0):
    """Index of first permutation of the pattern in text"""

    if len(text) - start < len(pattern):
        return -1

    primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

    value = {}
    goal = 1
    for p in pattern:
        if not p in value:
            value[p] = primes.pop(0)

        goal *= value[p]

    prod = 1
    for t in text[start:start + len(pattern)]:
        prod *= value.get(t, 1)

    i = start

    for j in range(start + len(pattern), len(text)):

        if goal == prod:
            return i

        prod /= value.get(text[i], 1)
        prod *= value.get(text[j], 1)

        i += 1

    if goal == prod:
        return len(text) - len(pattern)

    return -1

将此问题应用于您的问题：

import re

patt = "w1 w2 w3 w1 w2".split()

text = re.split("\W+", 
        """This is long text w1 w2 w3 w4 and w1 w2 w1 w2 w3. This 
        yet another long text which does not have permutation because 
        it does not contain all words w1,w2,w2,w2,w2 , but this is 
        permutation w2 w2 w3 w1 w1""")

p = permindex(text, patt)
while p >= 0:
    print p, text[p: p + len(patt)]
    p = permindex(text, patt, p + 1)

的产率：

9 ['w1', 'w2', 'w1', 'w2', 'w3']
40 ['w2', 'w2', 'w3', 'w1', 'w1']

以非常长的文本查找列表或单词列表的所有许可

2 个答案: