
时间:2020-01-27 15:54:10

标签: python algorithm list numpy graph


x = [[1, 2, 3, 4, 5, 6, 7],  # sequence 1
     [6, 5, 10, 11],  # sequence 2
     [9, 8, 2, 3, 4, 5],  # sequence 3
     [12, 12, 6, 5],  # sequence 4
     [5, 8, 3, 4, 2],  # sequence 5
     [1, 5],  # sequence 6
     [2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],  # sequence 7
     [7, 1, 7, 3, 4, 1, 2],  # sequence 8
     [9, 4, 12, 12, 6, 5, 1],  # sequence 9



  1. 如果列表中不存在M=4,则我们将完全忽略该列表
  2. 如果列表长度小于target,则我们将完全忽略列表
  3. 如果列表的长度恰好为M,但M不在target位置,则我们将其忽略(但如果Mth位于{{ 1}}位置)
  4. 如果列表长度target长于Mth,并且L位于M位置target i = M + 1 {{1 }} i = M + 2 i=M i = L (or M position, or目标在子序列中处于最终位置


position, ...,

当然,我们想要的是频率最高的position) then we count the subsequence of length子序列。因此,按计数,wheresubseqs = [[2, 3, 4, 5], # taken from sequence 1 [2, 3, 4, 5], # taken from sequence 3 [12, 12, 6, 5], # taken from sequence 4 [8, 8, 3, 5], # taken from sequence 7 [1, 4, 12, 5], # taken from sequence 7 [12, 12, 6, 5], # taken from sequence 9 ] 是最常见的两个序列。如果N=2,则将返回所有子序列([2, 3, 4, 5]),因为并列第三个。


  1. 由数十亿个正整数(1到10,000之间)组成的列表
  2. 每个列表可以短至1个元素或最长500个元素
  3. [12, 12, 6, 5]N=3可以小到1或大到100


  1. 假设subseqsN始终小于100,是否有一种有效的数据结构可以进行快速查询?
  2. 是否有有效的算法或相关研究领域可以针对MN的各种组合执行这种分析?

4 个答案:

答案 0 :(得分:0)

这是一个基于generalized suffix tree结构的想法。您的列表列表可以看作是一个字符串列表,其中字母将由整数组成(因此,带有您提供的信息的字母中约有10k个字符)。


然后,您还希望有一个来自(i, d)的查找表(其中i是您要查找的整数,目标,而d是您树中的深度) ,即M到后缀链接的节点集的集合,该节点集标记有深度为{{的字母(由字母组成,但不是由字符组成,而是由整数组成)” 1}}。可以通过遍历后缀链接(BFS或DFS)来构建此查找表。您甚至可以只存储对应于最高计数器值的节点。



对于后缀树的实现,我建议您仅阅读原始论文,直到您对这些论文有深入而真实的了解(例如thisthat,sc * -h * b可以成为您的朋友),而不是在线上的“说明”,因为它们充满了近似和错误(即使this post也会帮助您获得第一个想法,但如果您的目标是在某个时候会误导您是为了实现正确的版本。

答案 1 :(得分:0)


x = [[1, 2, 3, 4, 5, 6, 7],  # sequence 1
     [6, 5, 10, 11],  # sequence 2
     [9, 8, 2, 3, 4, 5],  # sequence 3
     [12, 12, 6, 5],  # sequence 4
     [5, 8, 3, 4, 2],  # sequence 5
     [1, 5],  # sequence 6
     [2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],  # sequence 7
     [7, 1, 7, 3, 4, 1, 2],  # sequence 8
     [9, 4, 12, 12, 6, 5, 1],  # sequence 9

lens = np.fromiter(map(len, x), np.int)
n1, n2 = len(lens), lens.max()
arr = np.zeros((n1, n2), dtype=np.int)

mask = np.arange(n2) < lens[:,None]
arr[mask] = np.concatenate(x)
>> [[ 1  2  3  4  5  6  7  0  0  0  0]
[ 6  5 10 11  0  0  0  0  0  0  0]
 [ 9  8  2  3  4  5  0  0  0  0  0]
 [12 12  6  5  0  0  0  0  0  0  0]
 [ 5  8  3  4  2  0  0  0  0  0  0]
 [ 1  5  0  0  0  0  0  0  0  0  0]
 [ 2  8  8  3  5  9  1  4 12  5  6]
 [ 7  1  7  3  4  1  2  0  0  0  0]
 [ 9  4 12 12  6  5  1  0  0  0  0]]


M = 4
N = 5
r, c = np.where(arr[:, M-1:]==N)
arr[r[:,None], (c[:,None] + np.arange(M))]
>>array([[ 2,  3,  4,  5],
   [ 2,  3,  4,  5],
   [12, 12,  6,  5],
   [ 8,  8,  3,  5],
   [ 1,  4, 12,  5],
   [12, 12,  6,  5]])

答案 2 :(得分:0)



def gen_m(lst, m, val):
    lst = sub_list to parse
    m = length required
    val = target value

    found = 0                                  # starts with 0 index
    for i in range(lst[m-1:].count(val)):      # repeat by the count of val
        found = lst.index(val, found) + 1      # set and find the next index of val
        yield tuple(lst[found-m: found])       # yield the sliced sub_list of m length as a tuple


from collections import Counter
target = 5
req_len = 4

# the yielded sub_lists need to be tuples to be hashable for the Counter
counter = Counter(sub_tup for lst in x for sub_tup in gen_m(lst, req_len, target))


req_N = 2

def gen_common(counter, n):
    s = set()
    for i, (item, count) in enumerate(counter.most_common()):
        if i < n or count in s:
            yield item

result = list(gen_common(counter, req_N))

结果,其中N == 2

[[2, 3, 4, 5], [12, 12, 6, 5]]

结果,其中N == 3

[[2, 3, 4, 5], [12, 12, 6, 5], [8, 8, 3, 5], [1, 4, 12, 5]]


x = [[1, 2, 3, 4, 5, 6, 7],  
     [6, 5, 10, 11],  
     [9, 8, 2, 3, 4, 5],  
     [12, 12, 6, 5],  
     [5, 8, 3, 4, 2],  
     [1, 5],  
     [2, 8, 8, 3, 5, 9, 1, 4, 12, 5, 6],  
     [7, 1, 7, 3, 4, 1, 2],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 2, 3, 4, 5, 1],  
     [9, 4, 8, 8, 3, 5, 1],  
     [9, 4, 7, 8, 9, 5, 1],     
     [9, 4, 1, 2, 2, 5, 1],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 12, 12, 6, 5, 1],  
     [9, 4, 1, 4, 12, 5],  
     [9, 1, 4, 12, 5, 1]  


Counter({(12, 12, 6, 5): 5, (2, 3, 4, 5): 3, (1, 4, 12, 5): 3, (8, 8, 3, 5): 2, (7, 8, 9, 5): 1, (1, 2, 2, 5): 1})


for i in range(6):
    # testing req_N from 0 to 5
    list(gen_common(c, i))

# req_N = 0: []
# req_N = 1: [(12, 12, 6, 5)]
# req_N = 2: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 3: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5)]
# req_N = 4: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5)]
# req_N = 5: [(12, 12, 6, 5), (2, 3, 4, 5), (1, 4, 12, 5), (8, 8, 3, 5), (7, 8, 9, 5), (1, 2, 2, 5)]

答案 3 :(得分:0)

由于不只有N,M和目标,所以我假设存在带有列表的列表块。这是一种O(N + M)时间复杂度的方法(其中N是块中列表的数量,M是元素总数):

def get_seq(x, M, target):
    index_for_length_m = M - 1
    for v in [l for l in x if len(l) >= M]:
        for i in [i for i, v in enumerate(v[index_for_length_m:], start=index_for_length_m) if v == target]:
            # convert to str to be hashable
            yield str(v[i - index_for_length_m : i + 1])

def process_chunk(x, M, N, target):
    return Counter(get_seq(x, M, target)).most_common(N)


process_chunk(x, M, 2, target)


[('[2, 3, 4, 5]', 2), ('[12, 12, 6, 5]', 2)]


%timeit process_chunk(x, M, 2, target)
# 25 µs ± 713 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)