Question

我认为最好从输入和输出开始

list_of_items = [
    {"A": "abc", "B": "dre", "C": "ccp"},
    {"A": "qwe", "B": "dre", "C": "ccp"},
    {"A": "abc", "B": "dre", "C": "ccp"},
]

result = {'A-abc-->B': {'dre': 2},
          'A-abc-->C': {'ccp': 2},
          'A-qwe-->B': {'dre': 1},
          'A-qwe-->C': {'ccp': 1},
          'B-dre-->A': {'abc': 2, 'qwe': 1},
          'B-dre-->C': {'ccp': 3},
          'C-ccp-->A': {'abc': 2, 'qwe': 1},
          'C-ccp-->B': {'dre': 3}}

我的最初输入是流式输入的项目。这些项目基本上是具有关键和价值的字典。我的目标是获取每个特定键，并为其附带的所有其他键取最大值。

因此，如果在100个项目中，对于值为“ 1”的键“ A”，我为90个项目获得了键“ B”的值“ 2”，而在10个项目中获得了键“ B”的值“ 1111”我想查看一个列表，向我显示这些数字。 B2 = 90，B1111 = 10。

我的代码正在运行。但是，我的现实生活场景包含大约20个键的100000个不同的值。另外，我的最终目标是将其作为Flink上的工作来运行。

所以我正在寻求有关Counter / python stream api的帮助。

all_tuple_list_items = []
for dict_item in list_of_items:
    list_of_tuples = [(k, v) for (k, v) in dict_item.items()]
    all_tuple_list_items.append(list_of_tuples)

result_dict = {}
for list_of_tuples in all_tuple_list_items:
    for id_tuple in list_of_tuples:
        all_other_tuples = list_of_tuples.copy()
        all_other_tuples.remove(id_tuple)
        dict_of_specific_corresponding = {}

        for corresponding_other_tu in all_other_tuples:
            ids_connection_id = id_tuple[0] + "-" + str(id_tuple[1]) + "-->" + corresponding_other_tu[0]
            corresponding_id = str(corresponding_other_tu[1])

            if result_dict.get(ids_connection_id) is None:
                result_dict[ids_connection_id] = {corresponding_id: 1}
            else:
                if result_dict[ids_connection_id].get(corresponding_id) is None:
                    result_dict[ids_connection_id][corresponding_id] = 1
                else:
                    result_dict[ids_connection_id][corresponding_id] = result_dict[ids_connection_id][
                                                                           corresponding_id] + 1

pprint(result_dict)

Answer 1

您可以使用函数permutations()生成字典中所有项目的排列，并使用Counter对其进行计数。最后，您可以使用defaultdict()对Counter中的项目进行分组：

from collections import Counter, defaultdict
from itertools import permutations
from pprint import pprint

list_of_items = [
    [{"A": "abc", "B": "dre", "C": "ccp"}],
    [{"A": "qwe", "B": "dre", "C": "ccp"}],
    [{"A": "abc", "B": "dre", "C": "ccp"}],
]

c = Counter(p for i in list_of_items       
              for p in permutations(i[0].items(), 2))
d = defaultdict(dict)
for ((i, j), (k, l)), num in c.items():
    d[f'{i}-{j}-->{k}'][l] = num

pprint(d)

输出：

defaultdict(<class 'dict'>,
            {'A-abc-->B': {'dre': 2},
             'A-abc-->C': {'ccp': 2},
             'A-qwe-->B': {'dre': 1},
             'A-qwe-->C': {'ccp': 1},
             'B-dre-->A': {'abc': 2, 'qwe': 1},
             'B-dre-->C': {'ccp': 3},
             'C-ccp-->A': {'abc': 2, 'qwe': 1},
             'C-ccp-->B': {'dre': 3}})

Answer 2

开始工作。但是，仍然希望获得一种更有效的方法。使用计数器和流。有可能吗？

代码

lines

计算元组列表中的出现次数

2 个答案:

输出：