确定有序子列表是否在大型列表中的最快方法?

时间:2018-03-27 00:48:04

标签: python list nested-lists

假设我有一个my_huge_list_of_lists,其中包含2,000,000个列表,每个列表的长度约为50个。

我希望通过丢弃序列中不包含两个元素的子列表来缩短2,000,000 my_huge_list_of_lists

到目前为止,我有:

# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
    #checks if a list appears in order in another larger list.
    n = len(sublst)
    return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if not check_if_list_is_sublist(x, [b,a])]

搜索词[a,b]或[b,a]的连续性非常重要,因此我无法使用set.issubset()

我觉得这很慢。我想加快速度。我考虑了一些选项,比如使用“提前退出”和声明:

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if (a in x and not check_if_list_is_sublist(x, [a,b]))]
使用for语句在or循环中

和更少次:

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                            if not (check_if_list_is_sublist(x, [a,b])
                                    or check_if_list_is_sublist(x, [b,a]))]

并且还致力于加速功能(WIP)

# https://stackoverflow.com/questions/48232080/the-fastest-way-to-check-if-the-sub-list-exists-on-the-large-list
def check_if_list_is_sublist(lst, sublst):
        checks if a list appears in order in another larger list.
        set_of_sublists = {tuple(sublst) for sublist in lst}

并在Stack Overflow上做了一些搜索;但由于调用check_if_list_is_sublist()的次数为len(my_huge_list) * 2,所以无法想到办法。

编辑:按要求添加一些用户数据

from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()

4 个答案:

答案 0 :(得分:2)

将n大小的子序列中的项目解压缩为n个变量。然后编写一个列表推导来过滤列表,检查子列表中的a,b或b,a。例如

a, b = sublst

def checklst(lst, a, b):
    l = len(lst)
    start = 0
    while True:
        try:
            a_index = lst.index(a, start)
        except ValueError:
            return False
        try:
            return a_index > -1 and lst[a_index+1] == b
        except IndexError:
            try:
                return a_index > -1 and lst[a_index-1] == b
            except IndexError:
                start = a_index + 1
                if start == l:
                    return False
                continue # keep looking at the next a

%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

答案 1 :(得分:1)

所以,我无法想到任何聪明的算法检查来真正减少这里的工作量。但是,您在代码中进行了大量分配,并且迭代太多。所以,只是将一些声明从函数中移出来有点让我

sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
    for i in range(len(lst) - (l -1)):
        if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
            return True
        if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
            return True
    return False

my_huge_list_of_lists = [x for x in my_huge_list_of_lists
                           if not check_if_list_is_sublist(x)]

将上面的示例代码的运行时间缩短了约50%。使用这个大小的列表,产生更多进程并分割工作可能也会看到性能提升。虽然不能想出任何真正减少比较量的方法......

答案 2 :(得分:1)

对于一个大型列表中的搜索匹配,我相信hash(element)然后构建索引将是一个很好的解决方案。

您将获得的好处: 建立索引一次,节省您的时间以备将来使用(不需要为每次搜索反复循环)。 甚至,我们可以在启动程序时构建索引,然后在程序退出时释放它,

下面的代码使用两种方法来获取哈希值:hash()和str();有时你应该根据你的特定场景自定义一个哈希函数。

如果使用str(),代码似乎很简单,并且不需要考虑哈希冲突。但它可能导致记忆炸弹。

对于hash(),我使用列表来保存所有具有相同哈希值的sub_lst。并且你可以使用hash(sub_lst)%designed_length来控制散列大小(但它会增加散列冲突率)。

以下代码的输出:

By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]

测试代码

from random import randint
from string import ascii_lowercase
import timeit

#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
    hash_db = {}
    for index in range(len(huge_list) - interval + 1):
        hash_sub = hash(str(huge_list[index:index+interval]))
        if hash_sub in hash_db:
            hash_db[hash_sub].append(huge_list[index:index+interval])
        else:
            hash_db[hash_sub] = [huge_list[index:index+interval]]
    return hash_db

hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
    hash_sub = hash(str(sublst))
    if hash_sub in hash_db:
        return any([sublst == item for item in hash_db[hash_sub]])
    return False

print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))

#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
    return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}

hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
    hash_sub = str(sublst)
    if hash_sub in hash_db:
        return sublst == hash_db[hash_sub]
    return False

print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))

#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
    #checks if a list appears in order in another larger list.
    n = len(sublst)
    return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))

print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))

如果您想从一个列表中删除匹配的元素,那么它是可行的,但效果是您可能必须重建新列表的索引。除非列表是链表,否则保存索引中每个元素的指针。我只是谷歌Python how to get the pointer for one element of a list,但找不到任何有用的东西。如果有人知道该怎么做,请不要犹豫,分享您的解决方案。感谢。

以下是一个示例:它生成一个新列表而不是返回原始列表,有时我们仍需要从原始列表中过滤某些内容

from random import randint
from string import ascii_lowercase
import timeit

#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
    hash_db = {}
    for index in range(len(huge_list) - interval + 1):
        hash_sub = hash(str(huge_list[index:index+interval]))
        if hash_sub in hash_db:
            hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
        else:
            hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
    return hash_db

hash_db = prepare(my_huge_list_of_lists, interval=2)

def check_sublist(hash_db, sublst): #use built-in hash function
    hash_sub = hash(str(sublst))
    if hash_sub in hash_db:
        return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
    return []

def remove_if_match_sublist(target_list, hash_db, sublsts):
    matches = []
    for sublst in sublsts:
        matches += check_sublist(hash_db, sublst)
    #make sure delete elements from end to begin
    sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
    new_list = list(target_list)
    for item in sorted_match:
        del new_list[item['beg']:item['end']]
    return new_list

print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))

答案 3 :(得分:1)

虽然这不是你所说的"答案"本身,它是一个基准测试框架,可以帮助您确定实现所需内容的最快方式,因为它允许相对容易的修改以及添加不同的方法。

我已将当前发布的答案放入其中,以及使用它们运行的​​结果。

注意事项:请注意, 尚未经过验证 ,其中所有经过测试的答案都是"正确的"从某种意义上说,他们实际上做了你想做的事,也没有他们在这个过程中消耗多少记忆 - 这可能是另一个考虑因素。

目前看来,@ Oluwafemi Sule的回答是距离最接近的竞争对手最快的一个数量级(10倍)。

from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback

N = 10  # Number of executions of each "algorithm".
R = 3  # Number of repetitions of those N executions.

from random import randint, randrange, seed
from string import ascii_lowercase

a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50  # Percentage of sublist that should get removed.

seed(42)  # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
                                for __ in range(NUM_SUBLISTS)]

# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
    list_index = randrange(NUM_SUBLISTS)
    sublist_index = randrange(SUBLIST_LEN)
    my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]

# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
    from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")

class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
    """ A test case is composed of separate setup and test code fragments. """
    def __new__(cls, setup, test):
        """ Dedent code fragment in each string argument. """
        return tuple.__new__(cls, (dedent(setup), dedent(test)))

testcases = {
    "OP (Nas Banov)": TestCase("""
        # https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
        def check_if_list_is_sublist(lst, sublst):
            ''' Checks if a list appears in order in another larger list. '''
            n = len(sublst)
            return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
        """, """
        shortened = [x for x in my_huge_list_of_lists
                        if not check_if_list_is_sublist(x, [a, b])]
        """
    ),
    "Sphinx Solution 1 (hash)": TestCase("""
        # https://stackoverflow.com/a/49518843/355230

        # Solution 1: By using built-in hash function.
        def prepare1(huge_list, interval=1): # Use built-in hash function.
            hash_db = {}
            for index in range(len(huge_list) - interval + 1):
                hash_sub = hash(str(huge_list[index:index+interval]))
                if hash_sub in hash_db:
                    hash_db[hash_sub].append(huge_list[index:index+interval])
                else:
                    hash_db[hash_sub] = [huge_list[index:index+interval]]
            return hash_db

        def check_sublist1(hash_db, sublst): # Use built-in hash function.
            hash_sub = hash(str(sublst))
            if hash_sub in hash_db:
                return any([sublst == item for item in hash_db[hash_sub]])
            return False
        """, """
        hash_db = prepare1(my_huge_list_of_lists, interval=2)
        shortened = [x for x in my_huge_list_of_lists
                        if check_sublist1(hash_db, x)]
        """
    ),
    "Sphinx Solution 2 (str)": TestCase("""
        # https://stackoverflow.com/a/49518843/355230

        #Solution 2: By using str() as hash function
        def prepare2(huge_list, interval=1): # Use str() as hash function.
            return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
                        for index in range(len(huge_list) - interval + 1)}


        def check_sublist2(hash_db, sublst): #use str() as hash function
            hash_sub = str(sublst)
            if hash_sub in hash_db:
                return sublst == hash_db[hash_sub]
            return False
        """, """
        hash_db = prepare2(my_huge_list_of_lists, interval=2)
        shortened = [x for x in my_huge_list_of_lists
                        if check_sublist2(hash_db, x)]
        """
    ),
    "Paul Becotte": TestCase("""
        # https://stackoverflow.com/a/49504792/355230
        sublst = [a, b]
        l = len(sublst)
        indices = range(len(sublst))

        def check_if_list_is_sublist(lst):
            for i in range(len(lst) - (l -1)):
                if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
                    return True
                if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
                    return True
            return False
        """, """
        shortened = [x for x in my_huge_list_of_lists
                        if not check_if_list_is_sublist(x)]
        """
    ),
    "Oluwafemi Sule": TestCase("""
        # https://stackoverflow.com/a/49504440/355230
        def checklst(lst, a, b):
            try:
                a_index = lst.index(a)
            except ValueError:
                return False
            try:
                return a_index > -1 and lst[a_index+1] == b
            except IndexError:
                try:
                    return a_index > -1 and lst[a_index-1] == b
                except IndexError:
                    return False
        """, """
        shortened = [x for x in my_huge_list_of_lists
                        if not checklst(x, a, b)]
        """
    ),
}

# Collect timing results of executing each testcase multiple times.
try:
    results = [
        (label,
         min(timeit.repeat(testcases[label].test,
                           setup=COMMON_SETUP + testcases[label].setup,
                           repeat=R, number=N)),
        ) for label in testcases
    ]
except Exception:
    traceback.print_exc(file=sys.stdout)  # direct output to stdout
    sys.exit(1)

# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
        .format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
      '({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()

longest = max(len(result[0]) for result in results)  # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
    print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
          ''.format(
                result[0], result[1], round(result[1]/fastest, 2),
                round((result[1]/fastest - 1) * 100, 2),
                width=longest))
print()

输出:

Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)

          Oluwafemi Sule :  0.006441 secs, rel speed  1.00x,   0.00% slower
            Paul Becotte :  0.069462 secs, rel speed 10.78x, 978.49% slower
          OP (Nas Banov) :  0.082758 secs, rel speed 12.85x, 1184.92% slower
 Sphinx Solution 2 (str) :  0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) :  0.154562 secs, rel speed 24.00x, 2299.77% slower