比较两个不同长度的列表的有效方法-Python

时间:2019-08-28 21:06:56

标签: python for-loop recursion

我正在尝试将list_A:60个元素与list_B:〜300,000个元素进行比较,并返回对list_B中每个元素出现的list_A中元素数量的计数(以列表形式)。 / p>

我的列表显示为:

list_A = ['CAT - cats are great', 'DOG - dogs are great too'] 
list_B = ['CAT - cats are great(A)DOG - dogs are great too(B)', 'DOG - dogs are great too(B)']

我希望我的计数返回:[2, 1]

我的实现可行,但是它包含一个嵌套的for循环,导致运行时间长。

list = []
for i in range(len(list_B)):
    count = 0
    for j in range(len(list_A)):
        if (list_A[j] in list_B[i]):
            count += 1
    list.append(count)
return list

任何帮助将不胜感激!谢谢:)

3 个答案:

答案 0 :(得分:0)

由于您正在寻找子字符串,所以我认为没有任何方法可以对其进行优化。不过,您可以使用列表推导和sum()来简化代码。

result = [sum(phrase in sentence for phrase in list_A) for sentence in list_B]

答案 1 :(得分:0)

如果您事先知道list_A,或者只需要运行一次,则

@Barmar的答案是快速而正确的。如果不是这种情况,则可以考虑使用以下方法(它也应该很快,但是步骤更多)。

import collections 

def count(target, summaries):
    return [sum(s[t] for t in target) for s in summaries]

mines = ['aa', 'ab', 'abc', 'aabc']
summaries = [collections.Counter(m) for m in mines]
gold = ['a', 'b']
silver = ['c']
assert count(gold, summaries) == [2, 2, 2, 3]
assert count(silver, summaries) == [0, 0, 1, 1]

还值得注意的是,如果您查看的是60/300000,则此玩具示例中可能缺少一些提速和简化的功能,例如如果60是数字1-60,或字母数字等,则也可能是不匹配的值的数量如此之小,以至于更容易计数并从长度中删除。

答案 2 :(得分:0)

我之前实际上已经回答了几乎相同的问题,可以在这里找到: https://stackoverflow.com/a/55914487/2284490 唯一的区别是您想知道算法上的len(matches)而不是any(matches)

这可以作为Aho Corasick algorithm

的变体有效地解决

这是一种高效的字典匹配算法,可在O(p + q + r)中同时定位文本中的模式,其中p =模式的长度,q =文本的长度,r =返回的匹配项的长度。

您可能想同时运行两个单独的状态机,并且需要对其进行修改,以便它们在第一个匹配项时终止。

我从this python implementation开始对修改进行了尝试

class AhoNode(object):
    def __init__(self):
        self.goto = {}
        self.count = 0
        self.fail = None

def aho_create_forest(patterns):
    root = AhoNode()
    for path in patterns:
        node = root
        for symbol in path:
            node = node.goto.setdefault(symbol, AhoNode())
        node.count += 1
    return root

def aho_create_statemachine(patterns):
    root = aho_create_forest(patterns)
    queue = []
    for node in root.goto.itervalues():
        queue.append(node)
        node.fail = root
    while queue:
        rnode = queue.pop(0)
        for key, unode in rnode.goto.iteritems():
            queue.append(unode)
            fnode = rnode.fail
            while fnode is not None and key not in fnode.goto:
                fnode = fnode.fail
            unode.fail = fnode.goto[key] if fnode else root
            unode.count += unode.fail.count
    return root

def aho_count_all(s, root):
    total = 0
    node = root
    for i, c in enumerate(s):
        while node is not None and c not in node.goto:
            node = node.fail
        if node is None:
            node = root
            continue
        node = node.goto[c]
        total += node.count
    return total

def pattern_counter(patterns):
    ''' Returns an efficient counter function that takes a string
    and returns the number of patterns matched within it
    '''
    machine = aho_create_statemachine(patterns)
    def counter(text):
        return aho_count_all(text, machine)
    return counter

并使用它

patterns = ['CAT - cats are great', 'DOG - dogs are great too'] 
counter = pattern_counter(patterns)
text_list = ['CAT - cats are great(A)DOG - dogs are great too(B)',
             'DOG - dogs are great too(B)']
for text in text_list:
    print '%r - %s' % (text, counter(text))

显示

'CAT - cats are great(A)DOG - dogs are great too(B)' - 2
'DOG - dogs are great too(B)' - 1

请注意,此解决方案分别计算每个匹配项,因此在“ aba”中查找“ a”和“ b”将得出3。如果每个模式只需要一个匹配项,则需要跟踪看到的所有模式,进行较小的修改以将整数转换为集合:

- self.count = 0
+ self.seen = set()
...
- node.count += 1
+ node.seen.add(path)
...
- unode.count += unode.fail.count
+ unode.seen |= unode.fail.seen
...
- total = 0
+ all_seen = set()
- total += node.count
+ all_seen |= node.seen
- return total
+ return len(all_seen)