快速查找列表中> 2000000个项目中重复项的索引的方法

时间:2019-01-01 20:23:14

标签: python list duplicates

我有一个列表,其中每个项目都是两个事件ID的组合: (这只是更大的对列表的一小段)

  

['10000381 10007121','10000381 10008989','10005169 10008989',   '10008989 10023817','10005169 10043265','10008989 10043265',   '10023817 10043265','10047097 10047137','10047097 10047265',   '10047137 10047265','10000381 10056453','10047265 10056453',   '10000381 10060557','10007121 10060557','10056453 10060557',   '10000381 10066013','10007121 10066013','10008989 10066013',   '10026233 10066013','10056453 10066013','10056453 10070153',   '10060557 10070153','10066013 10070153','10000381 10083798',   '10047265 10083798','10056453 10083798','10066013 10083798',   '10000381 10099969','10056453 10099969','10066013 10099969',   '10070153 10099969','10083798 10099969','10056453 10167029',   '10066013 10167029','10083798 10167029','10099969 10167029',   '10182073 10182085','10182073 10182177','10182085 10182177',   '10000381 10187233','10056453 10187233','10060557 10187233',   '10066013 10187233','10083798 10187233','10099969 10187233',   '10167029 10187233','10007121 10200685','10099969 10200685',   '10066013 10218005','10223905 10224013']

我需要找到每对ID的每个实例,并将其索引到新列表中。现在,我有几行代码可以为我完成此任务。但是,我的列表超过2,000,000行,随着我处理更多数据,列表会变得更大。

目前,预计完成时间约为2天。

我真的只需要一个更快的方法。

我正在使用Jupyter笔记本电脑(在Mac笔记本电脑上)

def compiler(idlist):
    groups = []
    for i in idlist:
        groups.append([index for index, x in enumerate(idlist) if x == i])
    return(groups)

我也尝试过:

def compiler(idlist):
    groups = []
    for k,i in enumerate(idlist):
        position = []
        for c,j in enumerate(idlist):
            if i == j:
                position.append(c)
        groups.append(position)
    return(groups)

我想要的是这样的东西:

'10000381 10007121':[0]
'10000381 10008989':[1]
'10005169 10008989':[2,384775,864173,1297105,1321798,1555094,1611064,2078015]
'10008989 10023817':[3,1321800]
'10005169 10043265':[4,29113,864195,1297106,1611081]
[5,864196,2078017]
'10008989 10043265':[6,29114,384777,864198,1611085,1840733,2078019]
'10023817 10043265':[7,86626,384780,504434,792690,864215,1297108,1321801,1489784,1524527,1555096,1595763,1611098,1840734,1841280,1929457,1943701,1983362,2093820,2139917,2168437] 等等 等等 等

方括号中的每个数字都是该对在idlist中的索引。

从本质上讲,我希望它查看一对id值(即'10000381 10007121'),并遍历列表,并找到该对的每个实例,并记录列表中的每个索引这对发生。我需要为列表中的每个项目执行此操作的项目。在更短的时间内。

3 个答案:

答案 0 :(得分:1)

使用列表而不是列表,该字典可以查找存在的O(1)

def compiler(idlist):
    groups = {}
    for idx, val in enumerate(idlist):
        if val in groups:  
            groups[val].append(idx)
        else:
            groups[val] = [idx]

答案 1 :(得分:1)

您可以使用collections.OrderedDict来将时间复杂度降低到O(n)。因为它记住插入顺序,所以这些值按照它们出现的顺序类似于各种id:

from collections import OrderedDict

groups = OrderedDict()
for i, v in enumerate(idlist):
    try:
        groups[v].append(i)
    except KeyError:
        groups[v] = [i]

然后list(groups.values())包含您的最终结果。

答案 2 :(得分:0)

如果您有大量数据,我建议您使用Pypy3而不是CPython解释器,您将获得x5-x7更快的代码执行速度。

这是使用CPythonPypy31000 iterations的基于时间的基准的实现:

代码:

from time import time
from collections import OrderedDict, defaultdict


def timeit(func, iteration=10000):
    def wraps(*args, **kwargs):
        start = time()
        for _ in range(iteration):
            result = func(*args, **kwargs)
        end = time()
        print("func: {name} [{iteration} iterations] took: {elapsed:2.4f} sec".format(
            name=func.__name__,
            iteration=iteration,
            args=args,
            kwargs=kwargs,
            elapsed=(end - start)
        ))
        return result
    return wraps


@timeit
def op_implementation(data):
    groups = []
    for k in data:
        groups.append([index for index, x in enumerate(data) if x == k])
    return groups


@timeit
def ordreddict_implementation(data):
    groups = OrderedDict()
    for k, v in enumerate(data):
        groups.setdefault(v, []).append(k)
    return groups


@timeit
def defaultdict_implementation(data):
    groups = defaultdict(list)
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        groups[v].append(k)
    return groups


@timeit
def defaultdict_implementation_2(data):
    groups = defaultdict(list)
    for k, v in enumerate(map(lambda x: tuple(x.split()), data)):
        groups[v].append(k)
    return groups


@timeit
def dict_implementation(data):
    groups = {}
    for k, v in enumerate([x for elm in data for x in elm.split()]):
        if v in groups:
            groups[v].append(k)
        else:
            groups[v] = [k]
    return groups



if __name__ == '__main__':
    data = [
        '10000381 10007121', '10000381 10008989', '10005169 10008989', '10008989 10023817', 
        '10005169 10043265', '10008989 10043265', '10023817 10043265', '10047097 10047137', 
        '10047097 10047265', '10047137 10047265', '10000381 10056453', '10047265 10056453', 
        '10000381 10060557', '10007121 10060557', '10056453 10060557', '10000381 10066013', 
        '10007121 10066013', '10008989 10066013', '10026233 10066013', '10056453 10066013', 
        '10056453 10070153', '10060557 10070153', '10066013 10070153', '10000381 10083798', 
        '10047265 10083798', '10056453 10083798', '10066013 10083798', '10000381 10099969', 
        '10056453 10099969', '10066013 10099969', '10070153 10099969', '10083798 10099969', 
        '10056453 10167029', '10066013 10167029', '10083798 10167029', '10099969 10167029', 
        '10182073 10182085', '10182073 10182177', '10182085 10182177', '10000381 10187233', 
        '10056453 10187233', '10060557 10187233', '10066013 10187233', '10083798 10187233', 
        '10099969 10187233', '10167029 10187233', '10007121 10200685', '10099969 10200685', 
        '10066013 10218005', '10223905 10224013'
    ]
    op_implementation(data)
    ordreddict_implementation(data)
    defaultdict_implementation(data)
    defaultdict_implementation_2(data)
    dict_implementation(data)

CPython:

func: op_implementation [10000 iterations] took: 1.3096 sec
func: ordreddict_implementation [10000 iterations] took: 0.1866 sec
func: defaultdict_implementation [10000 iterations] took: 0.3311 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.3817 sec
func: dict_implementation [10000 iterations] took: 0.3231 sec

Pypy3:

func: op_implementation [10000 iterations] took: 0.2370 sec
func: ordreddict_implementation [10000 iterations] took: 0.0243 sec
func: defaultdict_implementation [10000 iterations] took: 0.1216 sec
func: defaultdict_implementation_2 [10000 iterations] took: 0.1299 sec
func: dict_implementation [10000 iterations] took: 0.1175 sec

具有2000000次迭代的Pypy3:

func: op_implementation [200000 iterations] took: 4.6364 sec
func: ordreddict_implementation [200000 iterations] took: 0.3201 sec
func: defaultdict_implementation [200000 iterations] took: 2.2032 sec
func: defaultdict_implementation_2 [200000 iterations] took: 2.4052 sec
func: dict_implementation [200000 iterations] took: 2.2429 sec