Question

编程珍珠的第15.2节

可以在此处查看C代码：http://www.cs.bell-labs.com/cm/cs/pearls/longdup.c

当我使用suffix-array在Python中实现它时：

example = open("iliad10.txt").read()
def comlen(p, q):
    i = 0
    for x in zip(p, q):
        if x[0] == x[1]:
            i += 1
        else:
            break
    return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:]))  #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
    this_len = comlen(example[idx[i]:], example[idx[i+1]:])
    print this_len
    if this_len > max_len:
        max_len = this_len
        maxi = i

我发现idx.sort步骤非常慢。我认为它很慢，因为Python需要通过值而不是指针传递子字符串（如上面的C代码）。

可以从here

下载测试文件

C代码只需0.3秒即可完成。

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away. 

real    0m0.328s
user    0m0.291s
sys 0m0.006s

但是对于Python代码，它永远不会在我的计算机上结束（我等了10分钟并将其杀死）

有没有人有想法如何使代码高效？（例如，不到10秒）

Answer 1

我的解决方案基于后缀数组。它由 最长公共前缀的前缀加倍构成。最坏情况的复杂度是O（n（log n）^ 2）。我的笔记本电脑上的任务“iliad.mb.txt”需要4秒钟。函数suffix_array和longest_common_substring中的代码很好。后一种功能很短，可以很容易地修改，例如用于搜索10个最长的非重叠子串。如果重复的字符串超过10000个字符，则此Python代码比问题中的original C code (copy here)快。

from itertools import groupby from operator import itemgetter def longest_common_substring(text): """Get the longest common substrings and their positions. >>> longest_common_substring('banana') {'ana': [1, 3]} >>> text = "not so Agamemnon, who spoke fiercely to " >>> sorted(longest_common_substring(text).items()) [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])] This function can be easy modified for any criteria, e.g. for searching ten longest non overlapping repeated substrings. """ sa, rsa, lcp = suffix_array(text) maxlen = max(lcp) result = {} for i in range(1, len(text)): if lcp[i] == maxlen: j1, j2, h = sa[i - 1], sa[i], lcp[i] assert text[j1:j1 + h] == text[j2:j2 + h] substring = text[j1:j1 + h] if not substring in result: result[substring] = [j1] result[substring].append(j2) return dict((k, sorted(v)) for k, v in result.items()) def suffix_array(text, _step=16): """Analyze all common strings in the text. Short substrings of the length _step a are first pre-sorted. The are the results repeatedly merged so that the garanteed number of compared characters bytes is doubled in every iteration until all substrings are sorted exactly. Arguments: text: The text to be analyzed. _step: Is only for optimization and testing. It is the optimal length of substrings used for initial pre-sorting. The bigger value is faster if there is enough memory. Memory requirements are approximately (estimate for 32 bit Python 3.3): len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB Return value: (tuple) (sa, rsa, lcp) sa: Suffix array for i in range(1, size): assert text[sa[i-1]:] < text[sa[i]:] rsa: Reverse suffix array for i in range(size): assert rsa[sa[i]] == i lcp: Longest common prefix for i in range(1, size): assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]] if sa[i-1] + lcp[i] < len(text): assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]] >>> suffix_array(text='banana') ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2]) Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana' The Longest Common String is 'ana': lcp[2] == 3 == len('ana') It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:] """ tx = text size = len(tx) step = min(max(_step, 1), len(tx)) sa = list(range(len(tx))) sa.sort(key=lambda i: tx[i:i + step]) grpstart = size * [False] + [True] # a boolean map for iteration speedup. # It helps to skip yet resolved values. The last value True is a sentinel. rsa = size * [None] stgrp, igrp = '', 0 for i, pos in enumerate(sa): st = tx[pos:pos + step] if st != stgrp: grpstart[igrp] = (igrp < i - 1) stgrp = st igrp = i rsa[pos] = igrp sa[i] = pos grpstart[igrp] = (igrp < size - 1 or size == 0) while grpstart.index(True) < size: # assert step <= size nextgr = grpstart.index(True) while nextgr < size: igrp = nextgr nextgr = grpstart.index(True, igrp + 1) glist = [] for ig in range(igrp, nextgr): pos = sa[ig] if rsa[pos] != igrp: break newgr = rsa[pos + step] if pos + step < size else -1 glist.append((newgr, pos)) glist.sort() for ig, g in groupby(glist, key=itemgetter(0)): g = [x[1] for x in g] sa[igrp:igrp + len(g)] = g grpstart[igrp] = (len(g) > 1) for pos in g: rsa[pos] = igrp igrp += len(g) step *= 2 del grpstart # create LCP array lcp = size * [None] h = 0 for i in range(size): if rsa[i] > 0: j = sa[rsa[i] - 1] while i != size - h and j != size - h and tx[i + h] == tx[j + h]: h += 1 lcp[rsa[i]] = h if h > 0: h -= 1 if size > 0: lcp[0] = 0 return sa, rsa, lcp

我比more complicated O(n log n)更喜欢这个解决方案，因为Python有一个非常快速的列表排序（list.sort），可能比该文章的方法中必要的线性时间操作更快，应该是非常的O（n）随机字符串的特殊推定和小字母表（典型的DNA基因组分析）。我在Gog 2011中读到，我的算法的恶例O（n log n）实际上比许多O（n）算法更快，不能使用CPU内存缓存。

如果文本包含8 kB长的重复字符串，则基于grow_chains的另一个答案中的代码比问题中的原始示例慢19倍。长期重复的文本对于古典文献来说并不典型，但它们经常出现在在“独立”学校的家庭作业集合。该计划不应该冻结它。

我为Python 2.7,3.3 - 3.6编写了an example and tests with the same code。

Answer 2

主要问题似乎是python通过副本切片：https://stackoverflow.com/a/5722068/538551

您必须使用memoryview来获取引用而不是副本。当我这样做时，程序挂起 idx.sort函数（非常快）。

我确信通过一些工作，你可以让其余的工作。

修改

~~上述更改不能作为替代品使用，因为cmp与strcmp的工作方式不同。例如，尝试以下C代码：~~

#include <stdio.h> #include <string.h> int main() { char* test1 = "ovided by The Internet Classics Archive"; char* test2 = "rovided by The Internet Classics Archive."; printf("%d\n", strcmp(test1, test2)); }

将结果与此python进行比较：

test1 = "ovided by The Internet Classics Archive"; test2 = "rovided by The Internet Classics Archive." print(cmp(test1, test2))

C代码在我的机器上打印-3，而python版本打印-1。看起来示例C代码滥用strcmp的返回值（毕竟它在qsort中使用）。我找不到有关strcmp何时会返回[-1, 0, 1]以外的内容的任何文档，但在原始代码中向printf添加pstrcmp会显示很多值以外的内容该范围（3，-31,5是前3个值）。

~~要确保-3不是某些错误代码，如果我们反向test1和test2，我们将获得3。~~

修改

以上是有趣的琐事，但在影响任何代码块方面实际上并不正确。当我关闭笔记本电脑并离开wifi区域时，我意识到这一点......在我点击Save之前，我应该仔细检查一切。

FWIW，cmp肯定适用于memoryview个对象（按预期打印-1）：

print(cmp(memoryview(test1), memoryview(test2)))

我不确定为什么代码没有按预期工作。在我的机器上打印出列表看起来并不像预期的那样。我会调查一下，并尝试找到一个更好的解决方案，而不是抓住稻草。

Answer 3

将算法转换为Python：

from itertools import imap, izip, starmap, tee
from os.path   import commonprefix

def pairwise(iterable): # itertools recipe
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

def longest_duplicate_small(data):
    suffixes = sorted(data[i:] for i in xrange(len(data))) # O(n*n) in memory
    return max(imap(commonprefix, pairwise(suffixes)), key=len)

buffer()允许在不复制的情况下获取子字符串：

def longest_duplicate_buffer(data):
    n = len(data)
    sa = sorted(xrange(n), key=lambda i: buffer(data, i)) # suffix array
    def lcp_item(i, j):  # find longest common prefix array item
        start = i
        while i < n and data[i] == data[i + j - start]:
            i += 1
        return i - start, start
    size, start = max(starmap(lcp_item, pairwise(sa)), key=lambda x: x[0])
    return data[start:start + size]

iliad.mb.txt我的机器需要5秒钟。

原则上，可以使用suffix array增加lcp array在O（n）时间和O（n）内存中找到副本。

^{注意：*_memoryview()版本}

已弃用*_buffer()

更高效的内存版本（与longest_duplicate_small（）相比）：

def cmp_memoryview(a, b):
    for x, y in izip(a, b):
        if x < y:
            return -1
        elif x > y:
            return 1
    return cmp(len(a), len(b))

def common_prefix_memoryview((a, b)):
    for i, (x, y) in enumerate(izip(a, b)):
        if x != y:
            return a[:i]
    return a if len(a) < len(b) else b

def longest_duplicate(data):
    mv = memoryview(data)
    suffixes = sorted((mv[i:] for i in xrange(len(mv))), cmp=cmp_memoryview)
    result = max(imap(common_prefix_memoryview, pairwise(suffixes)), key=len)
    return result.tobytes()

iliad.mb.txt在我的机器上需要17秒。结果是：

On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.

我必须定义自定义函数来比较memoryview个对象，因为memoryview比较要么在Python 3中引发异常，要么在Python 2中产生错误的结果：

>>> s = b"abc"
>>> memoryview(s[0:]) > memoryview(s[1:])
True
>>> memoryview(s[0:]) < memoryview(s[1:])
True

找到Python最长重复字符串的有效方法（From Programming Pearls）

4 个答案: