如何优化迭代具有不同值大小的大型词典的循环?

时间:2019-03-15 10:20:50

标签: python-2.7 loops dataframe dictionary list-comprehension

这是我在这里的第一篇文章,希望有人可以帮助我。

我有一本庞大的字典,叫做“ hello”(> 200万个键)。该词典具有不同大小的值(有些是列表,有些只是一个值)。我必须遍历整个字典才能获得以下值:

portfolios = {k:v for k,v in hello.items() if '.some_list' in k}
hello_deltas = {k:v for k,v in hello.items() if '/delta[' or '/fast_spot[' or '/composite_delta_fx[' in k}

hello_before = {k:v for k,v in hello_deltas.items() if '_0_result' in k}
some_list_before = {}

for some_list in portfolios.values():

    for some in some_list:

        a = [i for i in hello_before.keys() if str(some) in i]

        if len(a) != 0:

            some_list_before[some] = a



hello_after = {k:v for k,v in hello_deltas.items() if '_1_result' in k}
some_list_after = {}

for some_list in portfolios.values():

    for some in some_list:

        a = [i for i in hello_after.keys() if str(some) in i]

        if len(a) != 0:

            some_list_after[some] = a

我已经对此进行了很多思考,并将其加速为一个庞大的理解词典组合。但是,这还不够。

我也尝试在pandas数据框中执行所有操作,但是由于字典值的大小不同,因此无法构建数据框!

有人可以帮我吗?

1 个答案:

答案 0 :(得分:0)

首先,您应该使用函数来避免冗余:

def before_after(hello, result):
    """result = '_0_result' (before) or '_1_result' (after)"""
    portfolios = {k:v for k,v in hello.items() if '.some_list' in k}
    hello_deltas = {k:v for k,v in hello.items() if '/delta[' or '/fast_spot[' or '/composite_delta_fx[' in k}
    hello_before_after = {k:v for k,v in hello_deltas.items() if result in k}
    some_list_before_after = {}
    for some_list in portfolios.values():
        for some in some_list:
            a = [i for i in hello_before_after.keys() if str(some) in i]
            if len(a) != 0:
                some_list_before_after[some] = a
    return some_list_before_after

然后,在深入了解列表理解之前,请看一下代码:您正在构建中间词典,但可以使用生成器:

portfolios_lists = (v for k,v in hello.items() if '.some_list' in k)
for some_list in portfolios_lists:
    for some in some_list:
        ...

或者更好:

portfolios_somes = (s for k,v in hello.items() for s in v if '.some_list' in k)
for some in portfolios_somes:
    ...

您只能使用hello_deltas中的键:

hello_deltas_before_after = [k for k in hello.keys() if result in k and ('/delta[' or '/fast_spot[' or '/composite_delta_fx[' in k)]

注意:使用某个函数,您可能会认为您将两次测试'/delta[' or '/fast_spot[' or '/composite_delta_fx[' in k:一次用于before,另一次用于after。实际上,这是不正确的:您首先测试result in k(即'_0_result' in k'_0_result' in k)和然后进行昂贵的测试。

代码现在看起来像:

def before_after(hello, result):
    portfolios_somes = (s for k,v in hello.items() for s in v if '.some_list' in k)
    hello_deltas_before_after = [k for k in hello.keys() if result in k and ('/delta[' or '/fast_spot[' or '/composite_delta_fx[' in k)]
    some_list_before_after = {}
    for some in portfolios_somes:
        a = [i for i in hello_deltas_before_after if str(some) in i]
        if len(a) != 0:
            some_list_before_after[some] = a
    return some_list_before_after

现在,字典理解:

some_list_before_after = {
    some: a 
    for some in portfolios_somes 
    for a in ([i for i in hello_deltas_before_after if str(some) in i], ) 
    if a}

一个元素元组是一种只计算一次a的技巧。完整代码(未经测试):

def before_after(hello, result):
    portfolios_somes = (s for k, v in hello.items() for s in v if '.some_list' in k)
    hello_deltas_before_after = [k for k in hello.keys() if result in k and ('/delta[' or '/fast_spot[' or '/composite_delta_fx[' in k)]
    return {
        some: a 
        for some in portfolios_somes 
        for a in ([i for i in hello_deltas_before_after if str(some) in i], ) 
        if a}

这应该比原始版本要快,但是问题有一个(非常粗略的)O(n ^ 2)时间复杂度,而且您一秒钟都不会得到巡回结果。在原始数据的子集(例如hello_short = dict(itertools.islice(hello.items(), 1000)))上进行尝试。