将字典分组到最小的键值对可能

时间:2018-06-15 22:02:20

标签: python dictionary grouping

我从API中提取数据,并创建一个类似于此的字典。

my_dict = {'server_name1':
               ['utah', 'california', 'idaho', 'texas'],
           'server_name2':
               ['NewYork'],
           'server_name3':
               ['idaho', 'new york', 'texas'],
           'server_name4':
               ['florida'],
           'server_name5':
               ['utah', 'california']}

我正在尝试创建维护组,因此我们只需要知道他们触摸的所有服务器一次就通知客户有关维护,这反过来又需要知道这些服务器上的其他客户所在的所有服务器。所以我希望尽可能多地组合这些组,我这样做是通过将至少有一个匹配值的键分组为其他键。所以我的字典会从上面写到:

new_dict = {'server_name1, server_name2, server_name3, server_name5':
                ['utah', 'california', 'idaho', 'texas', 'newyork'],
            'server_name4':
                ['florida']}

我有一些代码可以执行此操作,但需要多次迭代分组,除非您确切知道有多少次将事物分组以获得可能的最小数量的组,否则这不是很好。

这是我的工作代码。

new_dict = {}
for key in my_dict.iteritems():
    for key2 in my_dict.iteritems():
        if len(key[1]) > 0 and len(key2[1]) > 0:
            if key[0] != key2[0]:
                if all(x in key[1] for x in key2[1]) == True:
                    newkey = "{0}, {1}".format(key2[0],key[0])
                    servers = key[0] + ", " + key2[0]
                    states = key[1] + list(key2[1])
                    group = {servers:states}
                    new_dict.update(group)

2 个答案:

答案 0 :(得分:2)

这似乎不是“分组”操作,本身。相反,它似乎是图形闭包或聚类任务。我建议您将其更改为while逻辑:只要您可以合并两个群集,就可以继续循环现有数据集。您合并任何两个交集非空的聚类。

接近内部迭代控件的一种方法是使用for循环遍历所有dict条目;找到匹配项时合并。外循环重复此操作,直到找不到匹配项。

另一种方法是只关注第一个条目;查找具有非空重叠的另一个条目,并进行合并。一旦你不能再将第一个条目与任何其他条目合并,你就“退休”它:从“工作”字典中删除它并将其附加到“结果”字典。重复此操作,直到“工作”字典为空。

这会让你感动吗?

working = my_dict
results = {}

while len(working) > 0:
    # Remove the first entry from the working directory; hold locally
    next_key = list(working.keys())[0]
    next_val = set(working.pop(next_key))

    # Now, go through the remaining entries in "working"
    # Each time you find one with an element in common with "next_val",
    #   pop that from "working" and merge into "next_val" and "next_key"

    # When there are no more such merges to make ...
    results[next_key] = next_val

    # ... and return to the top of the outer loop
    #   to get the next independent entry.

答案 1 :(得分:2)

您想要实现的目标背后的摘要是在服务器和状态的图形中查找连接的组件。我们可以实施一个解决方案,将您的dict转换为图表,查找已连接的组件并转换回所需的格式。

首先,让我们定义帮助函数,以便我们将my_dict视为图形。

def get_cluster(x_to_y, y_to_x, x):
    # Implement a breadth-first search to recover all servers connected to x
    queue = [x]
    cluster = set()
    while queue:
        current = queue.pop()
        if current not in cluster:
            queue.extend({i for y in x_to_y[current] for i in y_to_x[y]})
            cluster.add(current)
    return cluster


def get_connected_parts(x_to_y):
    # We were provided a server -> state representation of the graph
    # For efficiency, we will generate a state -> server dict of edges
    y_to_x = {}

    for server, states in x_to_y.items():
        for state in states:
            if state in y_to_x:
                y_to_x[state].add(server)
            else:
                y_to_x[state] = {server}

    # We now iterate over our servers and recover their clusters
    seen = set()
    clusters = []

    for x in x_to_y:
        if x not in seen:
           cluster = get_cluster(x_to_y, y_to_x, x)
           seen |= cluster
           clusters.append(cluster)

    return clusters

现在大部分工作已经完成,函数get_connected_parts可用于检索已连接服务器的集合。剩下的就是格式化数据。但首先,让我们来看看它的输出。

my_dict = {
 'server_name1': ['utah', 'california', 'idaho', 'texas'],
 'server_name2': ['new york'],
 'server_name3': ['idaho', 'new york', 'texas'],
 'server_name4': ['florida'],
 'server_name5': ['utah', 'california']}

groups = get_connected_parts(my_dict)

print(groups)

输出:

[{'server_name2', 'server_name1', 'server_name3', 'server_name5'}, {'server_name4'}]

请注意,让密钥看起来像'server1, server2, server3, server5'没有多大意义,因为这会要求用户在尝试访问密钥时知道连接了哪些服务器。相反,我们将输出new_dict哪些键是服务器,值是所有间接连接的状态。

new_dict = {}

for group in groups:
    states = list({state for server in group for state in my_dict[server]})
    for state in group:
        new_dict[state] = states

我们可以使用pprint检查输出是否正确。

from pprint import pprint

pprint(new_dict)

输出:

{'server_name1': ['california', 'texas', 'idaho', 'utah', 'new york'],
 'server_name2': ['california', 'texas', 'idaho', 'utah', 'new york'],
 'server_name3': ['california', 'texas', 'idaho', 'utah', 'new york'],
 'server_name4': ['florida'],
 'server_name5': ['california', 'texas', 'idaho', 'utah', 'new york']}