我有一个像这样的字典:
{
"group-1": ["a.b", "c.d", "group-2"],
"group-2": ["e.f", "c.d"],
"group-3": ["group-1", "group-2"],
}
该字典很大,但可以很好地存储在内存中(数千个项目)。
我正在尝试解决这些组,以便每个组都获得其所有成员的列表。
因此,在这种情况下,“解决方案” -dict为:
{
"group-1": ["a.b", "c.d", "e.f"],
"group-2": ["e.f", "c.d"],
"group-3": ["a.b", "c.d", "e.f"],
}
因为每个组
.
,但项始终包含。.我不确定如何解决这个问题,而效率却很低。
到目前为止,这样的事情是行不通的,并且可能是错误的方向:
from pprint import pprint
from collections import defaultdict
def normalize(data):
group_map = defaultdict(set)
found_some = True
while found_some:
found_some = False
for k, v in data.items():
for i in v:
if "." in i:
if i not in group_map[k]:
group_map[k].add(i)
found_some = True
else:
....
return group_map
答案 0 :(得分:4)
您可以尝试使用递归函数来保持元素解析:
def resolve(d, key):
for x in d[key]:
if x in d:
yield from resolve(d, x)
else:
yield x
或一行:
def resolve(d, key):
return (y for x in d[key] for y in (resolve(d, x) if x in d else [x]))
应用于您的示例:
d = {
"group-1": ["a.b", "c.d", "group-2"],
"group-2": ["e.f", "c.d"],
"group-3": ["group-1", "group-2"],
}
r = {k: sorted(set(resolve(d, k))) for k in d}
# {'group-1': ['a.b', 'c.d', 'e.f'],
# 'group-2': ['c.d', 'e.f'],
# 'group-3': ['a.b', 'c.d', 'e.f']}
请注意,如果您的字典很大,则可能应添加@functools.lru_cache(None)
装饰器以向该函数添加备注。在这种情况下,您将必须删除不可散列的d
参数(并从周围的得分中使用d
)。根据引用的“深度”,您可能还必须increase the recursion limit。当然,如果存在循环依赖关系,这是行不通的(但我认为对于其他方法也是如此)。
答案 1 :(得分:2)
这样的事情怎么办?
def normalize(mapping):
result = {}
for k, v in mapping.items():
new_v = []
for x in v:
if x in mapping:
for y in mapping[x]:
if y not in v and y not in new_v:
new_v.append(y)
else:
new_v.append(x)
result[k] = new_v
return result
src = {
"group-1": ["a.b", "c.d", "group-2"],
"group-2": ["e.f", "c.d"],
"group-3": ["group-1", "group-2"],
}
print(src)
# {'group-1': ['a.b', 'c.d', 'group-2'], 'group-2': ['e.f', 'c.d'], 'group-3': ['group-1', 'group-2']}
tgt = {
"group-1": ["a.b", "c.d", "e.f"],
"group-2": ["e.f", "c.d"],
"group-3": ["a.b", "c.d", "e.f"],
}
print(tgt)
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}
print(normalize(src))
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}
print(tgt == normalize(src))
# True
请注意,对于超过1的嵌套级别和循环依赖关系,这可能会(会吗?)中断。
一种更通用的,保留顺序的方法可以克服深度限制,但速度较慢(至少对于所提供的输入而言):
def resolve(mapping, key):
for k in mapping[key]:
if k in mapping:
yield from resolve(mapping, k)
else:
yield k
def normalize_r(mapping):
result = {}
for k, v in mapping.items():
new_v = []
for item in resolve(mapping, k):
if item not in new_v:
new_v.append(item)
result[k] = new_v
return result
src = {
"group-1": ["a.b", "c.d", "group-2"],
"group-2": ["e.f", "c.d"],
"group-3": ["group-1", "group-2"],
}
print(src)
# {'group-1': ['a.b', 'c.d', 'group-2'], 'group-2': ['e.f', 'c.d'], 'group-3': ['group-1', 'group-2']}
tgt = {
"group-1": ["a.b", "c.d", "e.f"],
"group-2": ["e.f", "c.d"],
"group-3": ["a.b", "c.d", "e.f"],
}
print(tgt)
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}
print(normalize_r(src))
# {'group-1': ['a.b', 'c.d', 'e.f'], 'group-2': ['e.f', 'c.d'], 'group-3': ['a.b', 'c.d', 'e.f']}
print(tgt == normalize_r(src))
# True
%timeit normalize_r(src)
# 100000 loops, best of 3: 4.3 µs per loop
resolve()
函数来自@tobias_k。
与此处提出的方法相比,这将保持外观顺序。请注意,normalize_r()
不能单行,因为实际上需要new_v
来确定是否扩展自身以确保唯一包含。
使用set()
进行此操作的代价是您的订购不严格。
答案 2 :(得分:1)
以下方法可能会更有效。但是,由于使用set
,订单丢失了。如果顺序相关,则确实存在有序集实现。
d = {"group-1": ["a.b", "c.d", "group-2"],
"group-2": ["e.f", "c.d"],
"group-3": ["group-1", "group-2"]}
for key, value in d.items():
value_copy = list(value)
for i, v in enumerate(value):
try:
value_copy.extend(d[v])
value_copy.remove(v)
except:
pass
d[key] = list(set(value_copy))
我鼓励您使用%timeit
测试不同的方法,以确定最佳方法。在此示例中,此方法采用:
4.87 µs ± 201 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
但是由于此示例似乎与大数据不太相似,我认为您应该在更大的数据块上对其进行测试。
答案 3 :(得分:1)
如果要保留顺序,请在添加任何元素之前先检查其是否不存在:
from plumbum.cmd import split, seq, rev, dd
import plumbum
import unittest.mock as mock
# HACK: disable quoting of every argument in shquote
# otherwise we'd get --filter="dd 'of=$FILE'"
# which would create a file named $FILE anyway
with mock.patch('plumbum.commands.base.shquote', lambda x: x):
my_filter = str(rev | dd['of=$FILE'])
funnychars_new = plumbum.commands.base._funnychars.replace('$', '')
# HACK: don't treat dollar sign as an escapeable character
with mock.patch('plumbum.commands.base._funnychars', funnychars_new):
cmd = seq['1', '10'] | split['--filter', my_filter]
print(cmd)
cmd & plumbum.FG