解决列表中也分组的项目

时间:2019-10-17 10:52:06

标签: python python-3.x algorithm

我有一个像这样的字典:

{
    "group-1": ["a.b", "c.d", "group-2"],
    "group-2": ["e.f", "c.d"],
    "group-3": ["group-1", "group-2"],
}

该字典很大,但可以很好地存储在内存中(数千个项目)。

我正在尝试解决这些组,以便每个组都获得其所有成员的列表。

因此,在这种情况下,“解决方案” -dict为:

{
    "group-1": ["a.b", "c.d", "e.f"],
    "group-2": ["e.f", "c.d"],
    "group-3": ["a.b", "c.d", "e.f"],
}

因为每个组

  • 具有所有成员的列表
  • 解决了分组中的问题
  • 组不包含.,但项始终包含。.

我不确定如何解决这个问题,而效率却很低。


到目前为止,这样的事情是行不通的,并且可能是错误的方向:

from pprint import pprint
from collections import defaultdict

def normalize(data):
    group_map = defaultdict(set)

    found_some = True
    while found_some:
        found_some = False
        for k, v in data.items():
            for i in v:
                if "." in i:
                    if i not in group_map[k]:
                        group_map[k].add(i)
                        found_some = True
                else:
                    ....

    return group_map

4 个答案:

答案 0 :(得分:4)

您可以尝试使用递归函数来保持元素解析:

def resolve(d, key):
    for x in d[key]:
        if x in d:
            yield from resolve(d, x)
        else:
            yield x

或一行:

def resolve(d, key):
    return (y for x in d[key] for y in (resolve(d, x) if x in d else [x]))

应用于您的示例:

d = {
    "group-1": ["a.b", "c.d", "group-2"],
    "group-2": ["e.f", "c.d"],
    "group-3": ["group-1", "group-2"],
}
r = {k: sorted(set(resolve(d, k))) for k in d}
# {'group-1': ['a.b', 'c.d', 'e.f'],
#  'group-2': ['c.d', 'e.f'],
#  'group-3': ['a.b', 'c.d', 'e.f']}

请注意,如果您的字典很大,则可能应添加@functools.lru_cache(None)装饰器以向该函数添加备注。在这种情况下,您将必须删除不可散列的d参数(并从周围的得分中使用d)。根据引用的“深度”,您可能还必须increase the recursion limit。当然,如果存在循环依赖关系,这是行不通的(但我认为对于其他方法也是如此)。

答案 1 :(得分:2)

这样的事情怎么办?

def normalize(mapping):
    result = {}
    for k, v in mapping.items():
        new_v = []
        for x in v:
            if x in mapping:
                for y in mapping[x]:
                    if y not in v and y not in new_v:
                        new_v.append(y)
            else:
                new_v.append(x)
        result[k] = new_v
    return result
src = {
    "group-1": ["a.b", "c.d", "group-2"],
    "group-2": ["e.f", "c.d"],
    "group-3": ["group-1", "group-2"],
}
print(src)
# {'group-1': ['a.b', 'c.d', 'group-2'], 'group-2': ['e.f', 'c.d'], 'group-3': ['group-1', 'group-2']}

tgt = {
    "group-1": ["a.b", "c.d", "e.f"],
    "group-2": ["e.f", "c.d"],
    "group-3": ["a.b", "c.d", "e.f"],
}
print(tgt)
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}


print(normalize(src))
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}
print(tgt == normalize(src))
# True

请注意,对于超过1的嵌套级别和循环依赖关系,这可能会(会吗?)中断。


编辑

一种更通用的,保留顺序的方法可以克服深度限制,但速度较慢(至少对于所提供的输入而言):

def resolve(mapping, key):
    for k in mapping[key]:
        if k in mapping:
            yield from resolve(mapping, k)
        else:
            yield k


def normalize_r(mapping):
    result = {}
    for k, v in mapping.items():
        new_v = []
        for item in resolve(mapping, k):
            if item not in new_v:
                new_v.append(item)
        result[k] = new_v
    return result
src = {
    "group-1": ["a.b", "c.d", "group-2"],
    "group-2": ["e.f", "c.d"],
    "group-3": ["group-1", "group-2"],
}
print(src)
# {'group-1': ['a.b', 'c.d', 'group-2'], 'group-2': ['e.f', 'c.d'], 'group-3': ['group-1', 'group-2']}

tgt = {
    "group-1": ["a.b", "c.d", "e.f"],
    "group-2": ["e.f", "c.d"],
    "group-3": ["a.b", "c.d", "e.f"],
}
print(tgt)
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}


print(normalize_r(src))
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}
print(tgt == normalize_r(src))
# True


%timeit normalize_r(src)
# 100000 loops, best of 3: 4.3 µs per loop

resolve()函数来自@tobias_k。 与此处提出的方法相比,这将保持外观顺序。请注意,normalize_r()不能单行,因为实际上需要new_v来确定是否扩展自身以确保唯一包含。 使用set()进行此操作的代价是您的订购不严格。

答案 2 :(得分:1)

以下方法可能会更有效。但是,由于使用set,订单丢失了。如果顺序相关,则确实存在有序集实现。

d = {"group-1": ["a.b", "c.d", "group-2"],
     "group-2": ["e.f", "c.d"],
     "group-3": ["group-1", "group-2"]}

for key, value in d.items():
    value_copy = list(value)

    for i, v in enumerate(value):
        try:
            value_copy.extend(d[v])
            value_copy.remove(v)
        except:
            pass

    d[key] = list(set(value_copy))

我鼓励您使用%timeit测试不同的方法,以确定最佳方法。在此示例中,此方法采用:

4.87 µs ± 201 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

但是由于此示例似乎与大数据不太相似,我认为您应该在更大的数据块上对其进行测试。

答案 3 :(得分:1)

如果要保留顺序,请在添加任何元素之前先检查其是否不存在:

from plumbum.cmd import split, seq, rev, dd

import plumbum
import unittest.mock as mock
# HACK: disable quoting of every argument in shquote
# otherwise we'd get --filter="dd 'of=$FILE'"
# which would create a file named $FILE anyway
with mock.patch('plumbum.commands.base.shquote', lambda x: x):
    my_filter = str(rev | dd['of=$FILE'])

funnychars_new = plumbum.commands.base._funnychars.replace('$', '')
# HACK: don't treat dollar sign as an escapeable character
with mock.patch('plumbum.commands.base._funnychars', funnychars_new):
    cmd = seq['1', '10'] | split['--filter', my_filter]
    print(cmd)
    cmd & plumbum.FG