这是将“string contains substring”问题概括为(更多)任意类型。
给定一个序列(例如列表或元组),确定另一个序列是否在其中的最佳方法是什么?作为奖励,它应该返回子序列开始的元素的索引:
示例用法(序列中的序列):
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1 # or None, or whatever
到目前为止,我只是依靠蛮力,看起来既缓慢,丑陋又笨拙。
答案 0 :(得分:20)
我是第二个Knuth-Morris-Pratt算法。顺便说一句,你的问题(以及KMP解决方案)正是Python Cookbook第2版中的配方5.13。您可以在http://code.activestate.com/recipes/117214/
找到相关代码它在给定的序列中找到 all 正确的子序列,并且应该用作迭代器:
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,6]): print s
3
>>> for s in KnuthMorrisPratt([4,'a',3,5,6], [5,7]): print s
(nothing)
答案 1 :(得分:9)
这是一种蛮力方法O(n*m)
(类似于@mcella's answer)。对于小输入序列,它可能比纯Python O(n+m)
中的Knuth-Morris-Pratt算法实现更快(参见@Gregg Lind answer)。
#!/usr/bin/env python
def index(subseq, seq):
"""Return an index of `subseq`uence in the `seq`uence.
Or `-1` if `subseq` is not a subsequence of the `seq`.
The time complexity of the algorithm is O(n*m), where
n, m = len(seq), len(subseq)
>>> index([1,2], range(5))
1
>>> index(range(1, 6), range(5))
-1
>>> index(range(5), range(5))
0
>>> index([1,2], [0, 1, 0, 1, 2])
3
"""
i, n, m = -1, len(seq), len(subseq)
try:
while True:
i = seq.index(subseq[0], i + 1, n - m + 1)
if subseq == seq[i:i + m]:
return i
except ValueError:
return -1
if __name__ == '__main__':
import doctest; doctest.testmod()
我想知道在这种情况下小有多大?
答案 2 :(得分:5)
与字符串匹配先生... Knuth-Morris-Pratt string matching
相同答案 3 :(得分:3)
一种简单的方法:转换为字符串并依赖字符串匹配。
使用字符串列表的示例:
>>> f = ["foo", "bar", "baz"]
>>> g = ["foo", "bar"]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True
使用字符串元组的示例:
>>> x = ("foo", "bar", "baz")
>>> y = ("bar", "baz")
>>> xx = str(x).strip("()")
>>> yy = str(y).strip("()")
>>> yy in xx
True
使用数字列表的示例:
>>> f = [1 , 2, 3, 4, 5, 6, 7]
>>> g = [4, 5, 6]
>>> ff = str(f).strip("[]")
>>> gg = str(g).strip("[]")
>>> gg in ff
True
答案 4 :(得分:2)
>>> def seq_in_seq(subseq, seq):
... while subseq[0] in seq:
... index = seq.index(subseq[0])
... if subseq == seq[index:index + len(subseq)]:
... return index
... else:
... seq = seq[index + 1:]
... else:
... return -1
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
3
>>> seq_in_seq([5,7], [4,'a',3,5,6])
-1
对不起,我不是算法专家,这只是我脑海里想到的最快的事情,至少我觉得它看起来不错(对我来说),我很开心编码。 ; - )
很可能你的蛮力方法正在做同样的事情。
答案 5 :(得分:1)
对于小图案,蛮力可能没问题。
对于较大的,请查看Aho-Corasick algorithm。
答案 6 :(得分:1)
这是另一个KMP实施:
from itertools import tee
def seq_in_seq(seq1,seq2):
'''
Return the index where seq1 appears in seq2, or -1 if
seq1 is not in seq2, using the Knuth-Morris-Pratt algorithm
based heavily on code by Neale Pickett <neale@woozle.org>
found at: woozle.org/~neale/src/python/kmp.py
>>> seq_in_seq(range(3),range(5))
0
>>> seq_in_seq(range(3)[-1:],range(5))
2
>>>seq_in_seq(range(6),range(5))
-1
'''
def compute_prefix_function(p):
m = len(p)
pi = [0] * m
k = 0
for q in xrange(1, m):
while k > 0 and p[k] != p[q]:
k = pi[k - 1]
if p[k] == p[q]:
k = k + 1
pi[q] = k
return pi
t,p = list(tee(seq2)[0]), list(tee(seq1)[0])
m,n = len(p),len(t)
pi = compute_prefix_function(p)
q = 0
for i in range(n):
while q > 0 and p[q] != t[i]:
q = pi[q - 1]
if p[q] == t[i]:
q = q + 1
if q == m:
return i - m + 1
return -1
答案 7 :(得分:0)
我参加聚会有点晚了,但是这里有一些简单的使用字符串:
>>> def seq_in_seq(sub, full):
... f = ''.join([repr(d) for d in full]).replace("'", "")
... s = ''.join([repr(d) for d in sub]).replace("'", "")
... #return f.find(s) #<-- not reliable for finding indices in all cases
... return s in f
...
>>> seq_in_seq([5,6], [4,'a',3,5,6])
True
>>> seq_in_seq([5,7], [4,'a',3,5,6])
False
>>> seq_in_seq([4,'abc',33], [4,'abc',33,5,6])
True
正如 Ilya V. Schurov 所述,在这种情况下, find 方法不会返回带有多字符字符串或多位数字的正确索引。
答案 8 :(得分:0)
就其价值而言,我尝试使用这样的双端队列:
from collections import deque
from itertools import islice
def seq_in_seq(needle, haystack):
"""Generator of indices where needle is found in haystack."""
needle = deque(needle)
haystack = iter(haystack) # Works with iterators/streams!
length = len(needle)
# Deque will automatically call deque.popleft() after deque.append()
# with the `maxlen` set equal to the needle length.
window = deque(islice(haystack, length), maxlen=length)
if needle == window:
yield 0 # Match at the start of the haystack.
for index, value in enumerate(haystack, start=1):
window.append(value)
if needle == window:
yield index
deque 实现的一个优点是它只在干草堆上进行一次线性传递。因此,如果 haystack 正在流式传输,那么它仍然可以工作(与依赖切片的解决方案不同)。
解决方案仍然是蛮力,O(n*m)。一些简单的本地基准测试表明,它比 str.index
中字符串搜索的 C 实现慢约 100 倍。
答案 9 :(得分:-1)
另一种方法,使用集合:
typedef vector<int> minValues;