Question

我有一个（非常大的）列表，类似于：

a = ['A', 'B', 'A', 'B', 'A', 'C', 'D', 'E', 'D', 'E', 'D', 'F', 'G', 'A', 'B']

我想从中提取一个列表列表，如：

result = [['A', 'B', 'A', 'B', 'A'], ['D', 'E', 'D', 'E', 'D']]

重复模式可以不同，例如，也可以有一些间隔，例如：

['A', 'B', 'C', 'A', 'D', 'E', 'A'] (with a 'jump' over two elements)

我写了一个非常简单的代码，看起来很有效：

tolerance = 2
counter = 0
start, stop = 0, 0
for idx in range(len(a) - 1):
    if a[idx] == a[idx+1] and counter == 0:
        start = idx
        counter += 1
    elif a[idx] == a[idx+1] and counter != 0:
        if tolerance <= 0: 
            stop = idx
        tolerance = 2
    elif a[idx] != a[idx+1]:
        tolerance -= 1
    if start != 0 and stop != 0:
        result = [a[start::stop]]

但是1）这非常麻烦，2）我需要将此应用到非常大的列表，所以有没有更简洁，更快速的实现方法？

编辑：正如@Kasramvd正确指出的那样，我需要能够满足要求的最大集合（至多在起始元素和结束元素之间具有一定的跳跃公差），所以我认为：

['A', 'B', 'A', 'B', 'A'] instead of [ 'B', 'A', 'B' ]

因为前者包括后者。

如果代码可以选择达到一定公差的元素也将是一件好事，例如，如果公差（元素的最大数量不等于开始或结束元素）为2，则它还应返回以下集合： / p>

['A', 'A', 'A', 'B', 'A', 'B', 'A', 'C', 'D', 'A']

具有tolerances 0、1和2。

Answer 1

解决方案，除了子列表结果外，没有任何其他列表的复制：

def sublists(a, tolerance):
    result = []
    index = 0

    while index < len(a):
        curr = a[index]

        for i in range(index, len(a)):
            if a[i] == curr:
                end = i
            elif i - end > tolerance:
                break

        if index != end:
            result.append(a[index:end+1])
        index += end - index + 1

    return result

用法简单如下：

a = ['A', 'B', 'A', 'B', 'A', 'C', 'D', 'E', 'D', 'E', 'D', 'F', 'G', 'A', 'B']

sublists(a, 0)  # []
sublists(a, 1)  # [['A', 'B', 'A', 'B', 'A'], ['D', 'E', 'D', 'E', 'D']]
sublists(a, 2)  # [['A', 'B', 'A', 'B', 'A'], ['D', 'E', 'D', 'E', 'D']]

可能的解决方案，如注释中所述：

if i > index and a[i] == a[i-1] == curr:
    end = i - 1
    break
elif a[i] == curr:
    end = i
elif i - end > tolerance:
    break

注意：我尚未对此进行彻底的测试。

Answer 2

递归编写可能更容易。

def rep_sublist(x):
    global thisrun, collection
    if len(x) == 0:
        return None
    try: # find the next value in x that is same as x[0]
        nextidx = x[1:].index(x[0])
    except ValueError: # not found, set nextidx to something larger than tol
        nextidx = tol + 1

    if nextidx <= tol: # there is repetition within tol, add to thisrun, restart at the next repetition
        thisrun += x[:nextidx+1]
        rep_sublist(x[nextidx+1:])
    else: # no rep within tol, add in the last element, restart afresh from the next element
        thisrun += x[0]
        if len(thisrun)>1:
            collection.append(thisrun)
        thisrun = []
        rep_sublist(x[1:])


tol = 2
collection = []
thisrun = []
x = ['A', 'B', 'A', 'B', 'A', 'C', 'D', 'E', 'D', 'E', 'D', 'F', 'G', 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'A', 'C', 'D', 'A']
rep_sublist(x)
print(collection)

#[['A', 'B', 'A', 'B', 'A'], ['D', 'E', 'D', 'E', 'D'], ['A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'A', 'C', 'D', 'A']]


tol = 1 # now change tolerance to 1
collection = []
thisrun = []
rep_sublist(x)
print(collection) # last sublist is shorter

#[['A', 'B', 'A', 'B', 'A'], ['D', 'E', 'D', 'E', 'D'], ['A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'A']]

这使用全局变量，很容易将其包装到函数中

Answer 3

list.index()实际上最多接受3个参数，在这里可以使用很多。您只需使用l.index(item, start + 1, start + tolerance + 2)查找下一个项目，然后捕获它引发的ValueError。

l = list("aaa,..a/,a../a,.aaa.a,..a/,.aaa.,..aaa.,..a/.,a..a,./a.aaa.,a.a..a/.aa..a,.a/a.,a../.,a/..a..a/.a..,a/.,.a/a.")

def find_sublist(l, start, tol, found):
    # a is the value to check, i_l and i_r stand for "index_left" and "index_right", respectively
    a = l[start]
    i_l, i_r = start, start
    try:
        while True:
            i_r = l.index(a, i_r + 1, i_r + tol + 2)
    except ValueError:
        pass

    if i_l < i_r:
        found.append(l[i_l:i_r + 1])
    return i_r + 1

def my_split(l)
    found = []
    i = 0
    while i < len(l):
        i = find_sublist(l, i, 2, found)

print([ "".join(s) for s in my_split(l) ])

输出（末尾的连接用于说明目的-字符串比单个字符列表更易于阅读）：

['aaa', '..', 'a/,a', '..', 'a,.aaa.a', '..', 'aaa', '.,..', 'aaa', '.,..a/.,a..a,./a.', 'aaa.,a.a..a/.aa..a,.a/a.,a', '../.', '..a..a/.a..', '.,.', 'a/a']

对于带有tol = 2的示例输入（第一个块），它将得到以下结果：

['ABABA', 'DEDED']

主要功能find_sublist的10行（非空白）和用法my_split的4行。当普通循环完成这项工作时，我不喜欢递归。

Answer 4

您可以为此定义一个自定义迭代器。无需大量创建子列表。

这个想法很简单：

按照步长（您将其称为'jump'）对列表进行切片。
遍历切片列表，并检查前一个元素是否等于当前元素：
- 是：请记住您当前在子列表中并继续
- 否：检查您是否在子列表中：
  - 是：您位于子列表的末尾，因此yield对应于该子列表的列表片并继续。
  - 否：继续前进并继续寻找

一些小并发症：您需要对0到step之间的任何起始索引执行此过程，否则，对于任何0 l[x+i]==l[x+step+i]的重复模式}。

这就是迭代器的样子：

step

这是如何使用它：

def get_sec_it(a_list, step=1): for _start in range(step): # this is the minor complication prev_el = a_list[_start] # as we compare previous and current element prev_idx = _start # we store the first element here and iterate from the second on insec = False for idx in range(_start + step, len(a_list), step): # iteration from the second element of the sliced list el = a_list[idx] # get the element if el==prev_el: # compare it with previous (step 2 first check) insec=True continue # now we are in the first no of the 2. step, so 2. step - no if insec: # 2. step - no - yes: insec = False yield a_list[prev_idx: idx - step + 1] prev_el = el # continue the iteration by prev_idx = idx # updating the previous element if insec: # at the very end of a slice we wont necessarily encounter an element different from the previous one yield a_list[prev_idx:idx+1] # so in this case yield the sequence if we were in one.l

快速。内存高效。易于使用。

Levoilà，欢迎您！：）

Answer 5

我认为这实现了您想要的序列查找逻辑。我相当确定可以对其进行改进，但是希望它仍然很有用。

object

Answer 6

有点类似于@RadhikeJCJ-

a = ['A', 'B', 'A', 'B', 'A', 'C', 'D', 'E', 'D', 'E', 'D', 'F', 'G', 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'B', 'A', 'C', 'D', 'A']
tol = 1
a_str = ''.join(a)

idx_to_split = 0
output = []
while idx_to_split < len(a_str):
    a_str = a_str[idx_to_split:]
    split_char = a_str[0]
    all_substrs = a_str.split(split_char)[1:]
    if len(all_substrs) == 1:
        idx_to_split = 1
        continue
    out = []
    for i in all_substrs:
        if i == '':
            out.append("")
        elif len(i) <= tol:
            out.append(i)
        else:
            break

    if out:
        final = split_char + '{0}'.format(split_char).join(out)
        if out[-1] != '':
            final = final + split_char
        idx_to_split = len(final)
        output.append(final)
    else:
        idx_to_split = 1

#For tolerance 2,
#output = ['ABABA', 'DEDED', 'ABAAABABACDA']

#For tolerance 1,
#output = ['ABABA', 'DEDED', 'ABAAABABA']

Answer 7

您需要在此处设置所需的长度：如果len（tmp）> 2：

如果您想要长度为5：

len（tmp）== 5或其他...

a = ['A', 'B', 'A', 'B', 'A', 'C', 'D', 'E', 'D', 'E', 'D', 'F', 'G', 'A', 'B']
start = -1
stop = -1
result = []
for i,c in enumerate(a):
    start = i
    for idx in range(i,len(a)-1,2):
        if c == a[idx]:
            stop = idx+1
        else:
            break
    tmp = a[start:stop]
    if len(tmp) == 5:
        result.append(tmp)
        print(tmp)
    start = -1
    stop = -1
print(result)
#[['A', 'B', 'A', 'B', 'A'], ['D', 'E', 'D', 'E', 'D']]

Answer 8

如果您的目标是速度并且可以轻松地对数据进行分类，那么我建议您使用numpy解决方案。

假设您有

a = np.array(['A', 'B', 'A', 'B', 'A', 'C', 'D', 'E', 'D', 'E', 'D', 'F', 'G', 'A', 'B'])
tolerance = 1

要检查是否有任何元素在精确的公差范围内相等，可以执行类似diff的操作，但要保持相等：

tolerance += 1
mask = a[:-tolerance] == a[tolerance:]

如果在右侧涂抹此布尔蒙版tolerance元素，则每个连续的运行将是您感兴趣的元素。执行此操作的一种简短方法是使用np.lib.stride_tricks.as_strided：

def smear(mask, n):
    view = np.lib.stride_tricks.as_strided(mask, shape=(n + 1, mask.size - n),
                                           strides=mask.strides * 2)
    view[1:, view[0]] = True

由于它是就地运行的，因此您甚至可以将其变成单线：

np.lib.stride_tricks.as_strided(mask, shape=(n + 1, mask.size - n),
                                strides=mask.strides * 2)[1:, mask[:-n]] = True

然后您应用它：

smear(mask, tolerance)

使用np.diff，np.flatnonzero和np.split（参考）的组合，可以轻松找到和提取连续运行：

result = np.split(a, np.flatnonzero(np.diff(m)) + 1)[1 - m[0]::2]

此解决方案唯一缺少的是，它不会拾取彼此之间相距少于tolerance的匹配元素。为此，我们可以使用np.lib.stride_tricks.as_strided来制作遮罩，其方式要考虑到公差（使用np.any）：

b = np.lib.stride_tricks.as_strided(np.r_[a, np.zeros(tolerance, dtype=a.dtype)],
                                    shape=(tolerance + 1, a.size),
                                    strides=a.strides * 2)

b现在是一个3x15的数组（其中a的长度为15），第二维只是开头的字符。请记住，这只是原始数据的视图。对于大型阵列，此操作基本上是免费的。

现在，您可以将np.any应用于第一个维度，以找出哪些字符在彼此的公差范围内重复出现：

mask = np.any(b[0] == b[1:], axis=0)

从这里开始，我们像以前一样继续。这使得功能相当小：

TL; DR

def find_patterns(a, tol):
    a = np.asanyarray(a)
    tol += 1
    b = np.lib.stride_tricks.as_strided(np.r_[a, np.zeros(tol, dtype=a.dtype)],
                                        shape=(tol + 1, a.size),
                                        strides=a.strides * 2)
    mask = np.any(b[0] == b[1:], axis=0)
    np.lib.stride_tricks.as_strided(mask, shape=(tol + 1, mask.size - tol),
                                    strides=mask.strides * 2)[1:, mask[:-tol]] = True
    return np.split(a, np.flatnonzero(np.diff(mask)) + 1)[1 - mask[0]::2]

>>> find_patterns(['A', 'B', 'A', 'B', 'A', 'C', 'D', 'E', 'D', 'E', 'D', 'F', 'G', 'A', 'B'], 1)
[array(['A', 'B', 'A', 'B', 'A'], dtype='<U1'),
 array(['D', 'E', 'D', 'E', 'D'], dtype='<U1')]
>>> find_patterns(['A', 'B', 'C', 'A', 'D', 'E', 'A'], 1)
[]
>>> find_patterns(['A', 'B', 'C', 'A', 'D', 'E', 'A'], 2)
[array(['A', 'B', 'C', 'A', 'D', 'E', 'A'], dtype='<U1')]

附录

如果仔细阅读下面的参考文献，您会发现，出于简洁而非速度的考虑，选择了在此处显示的涂抹蒙版和查找阵列的蒙版部分的方法。来自here的一种更快的涂抹口罩的方法是：

def smear(mask, n):
    n += 1
    mask1 = mask.copy()
    len0, len1 = 1, 1
    while len0 + len1 < n:
        mask[len0:] |= mask1[:-len0]
        mask, mask1 = mask1, mask
        len0, len1 = len1, len0 + len1
    mask1[n - len0:] |= mask[:-n + len0]
    return mask1

类似地，从数组（从here提取）中提取连续的遮盖区域的更快方法是：

def extract_masked(a, mask):
    mask = np.concatenate(([False], mask, [False]))
    idx = np.flatnonzero(mask[1:] != mask[:-1])
    return [a[idx[i]:idx[i + 1]] for i in range(0, len(idx), 2)]

参考

从python列表中选择子列表，从相同元素开始和结束

8 个答案: