Python嵌套字典列表的平均列表

时间:2020-06-24 08:13:50

标签: python python-3.x

我有一个具有以下结构的列表:

data = [[
        {
            "id": 713,
            "prediction": 4.8,
            "confidence": [
                {"percentile": "75", "lower": 4.8, "upper": 5.7}
            ],
        },
        {
            "id": 714,
            "prediction": 4.93,
            "confidence": [
                {"percentile": "75", "lower": 4.9, "upper": 5.7}
            ],
        },
    ],
    [
        {
            "id": 713,
            "prediction": 5.8,
            "confidence": [
                {"percentile": "75", "lower": 4.2, "upper": 6.7}
            ],
        },
        {
            "id": 714,
            "prediction": 2.93,
            "confidence": [
                {"percentile": "75", "lower": 1.9, "upper": 3.7}
            ],
        },
    
    ]]

因此,这里有一个包含两个列表的列表,但也可以是更多列表。每个列表包含一个带有ID和预测区间的预测,另一个列表包含一个字典。

我需要合并这些列表,以便每个id对应一个平均值的数字。

我尝试搜索,但找不到与该嵌套结构匹配的答案。

预期输出如下:

merged_data = [
            {
                "id": 713,
                "prediction": 5.3,
                "confidence": [
                    {"percentile": "75", "lower": 4.5, "upper": 6.2}
                ],
            },
            {
                "id": 714,
                "prediction": 3.93,
                "confidence": [
                    {"percentile": "75", "lower": 3.4, "upper": 4.7}
                ],
            },
        ]

4 个答案:

答案 0 :(得分:2)

def merge_items(items):
    result = {}
    if len(items):
        result['id'] = items[0]['id']
        result['prediction'] = round(sum([item['prediction'] for item in items]) / len(items), 2)
        result['confidence'] = []
        result['confidence'].append({
            'percentile': items[0]['confidence'][0]['percentile'],
            'lower': round(sum(item['confidence'][0]['lower'] for item in items) / len(items), 2),
            'upper': round(sum(item['confidence'][0]['upper'] for item in items) / len(items), 2),
        })

    return result


result = []
ids = list(set([el['id'] for item in data for el in item]))
for id in ids:
    to_merge = [sub_item for item in data for sub_item in item if sub_item['id'] == id]
    result.append(merge_items(to_merge))

print(result)

答案 1 :(得分:1)

dicc = {}

for e in l:
    for d in e:
        if d["id"] not in dicc:
            dicc[d["id"]] = {"prediction": [], "lower": [], "upper": []}

        dicc[d["id"]]["prediction"].append(d["prediction"])
        dicc[d["id"]]["lower"].append(d["confidence"][0]["lower"])
        dicc[d["id"]]["upper"].append(d["confidence"][0]["upper"])
        
        
for k in dicc:
    dicc[k]["average_prediction"] = sum(dicc[k]["prediction"])/len(dicc[k]["prediction"])
    dicc[k]["average_lower"] = sum(dicc[k]["lower"])/len(dicc[k]["lower"])
    dicc[k]["average_upper"] = sum(dicc[k]["upper"])/len(dicc[k]["upper"])

print(dicc)

{713:{'prediction':[4.8,5.8],'lower':[4.8,4.2],'upper':[5.7,6.7],'average_prediction':5.3,'average_lower':4.5,' average_upper':6.2},714:{'prediction':[4.936893921359024,2.936893921359024],'lower':[4.9,1.9],'upper':[5.7,3.7],'average_prediction':3.936893921359024,'average_lower':3.4000000000000004 ,'average_upper':4.7}}

答案 2 :(得分:1)

您确实有三个问题。

  1. 您如何打开列表并按ID分组以准备某种聚合?您有很多选择,但是一个非常经典的选择是制作一个查询表并附加任何新值:
groups = {}

# `data` is the outer list in your nested structure
for d in (d for L in data for d in L):
    L = groups.get(d['id'], [])
    L.append(d)
    groups[d['id']] = L
  1. 如何汇总这些词典,以便获得所有数值的平均值?有许多方法具有不同的数值稳定性。我将从一个简单的示例开始,该示例以递归方式遍历部分结果集和一个新条目。

请注意,这假定了一个非常一致的对象结构(如您所示)。如果您有时缺少键,长度不匹配或其他差异,则必须认真思考合并这些结构时要发生的事情的确切细节-没有一种适合所有人的解决方案

def walk(avgs, new, n):
    """
    Most of this algorithm is just walking the object structure.
    We keep any keys, lists, etc the same and only average the
    numeric elements.
    """
    if isinstance(avgs, dict):
        return {k:walk(avgs[k], new[k], n) for k in avgs}
    if isinstance(avgs, list):
        return [walk(x, y, n) for x,y in zip(avgs, new)]
    if isinstance(avgs, float):  # integers and whatnot also satisfy this
        """
        This is the only place that averaging actually happens.
        At the risk of some accumulated errors, this directly
        computes the total of the last n+1 items and divides
        by n+1.
        """
        return (avgs*n+new)/(n+1.)
    return avgs

def merge(L):
    if not L:
        # never happens using the above grouping code
        return None
    d = L[0]
    for n, new in enumerate(L[1:], 1):
        d = walks(d, new, n)
    return d

averaged = {k:merge(v) for k,v in groups.items()}

您可能只希望对某些键(例如预测)进行平均。您可以事先对分组的对象进行过滤,也可以事后进行过滤(事先进行过滤可能更有效):

# before
groups = {
    # any transformation you'd like to apply to the dictionaries
    k:[{s:d[s] for s in ('prediction', 'confidence')} for d in L] for k,L in groups.items()
}

# after
averaged = {
    # basically the same code, except there's only one object per key
    k:{s:d[s] for s in ('prediction', 'confidence')} for k,d in averaged.items()
}

关于效率的说明,我创建了一堆中间列表,但实际上并不是必需的。您可以完全应用滚动更新算法并节省一些内存,而不用进行分组然后进行汇总。

averaged = {}

# `data` is the outer list in your nested structure
for d in (d for L in data for d in L):
    key = d['id']
    d = {s:d[s] for s in ('prediction', 'confidence')}  # any desired transforms

    if key not in averaged:
        averaged[key] = (d, 1)
    else:
        agg, n = groups[key]
        averaged[key] = (walk(agg, d, n), n+1)

averaged = {k:v[0] for k,v in averaged.items()}
  1. 我们仍然没有像您想要的那样格式化输出(我们有一个字典,并且您想要一个列表,其中键包含在对象中)。但这是一个很容易解决的问题:
def inline_key(d, key):
    # not a pure function, but we're lazy, and the original
    # values are never used
    d['id'] = key
    return d

final_result = [inline_key(d, k) for k,d in averaged.items()]

答案 3 :(得分:1)

尝试一下:

from copy import deepcopy

input = [[
    {
        "id": 713,
        "prediction": 4.8,
        "confidence": [
            {"percentile": "75", "lower": 4.8, "upper": 5.7}
        ],
    },
    {
        "id": 714,
        "prediction": 4.936893921359024,
        "confidence": [
            {"percentile": "75", "lower": 4.9, "upper": 5.7}
        ],
    },
],
[
    {
        "id": 713,
        "prediction": 5.8,
        "confidence": [
            {"percentile": "75", "lower": 4.2, "upper": 6.7}
        ],
    },
    {
        "id": 714,
        "prediction": 2.936893921359024,
        "confidence": [
            {"percentile": "75", "lower": 1.9, "upper": 3.7}
        ],
    },

]]

final_dict_list = []

processed_id = []

for item in input:
    for dict_ele in item:
        if dict_ele["id"] in processed_id:
            for final_item in final_dict_list:
                if final_item['id'] == dict_ele["id"]:
                    final_item["prediction"] += dict_ele["prediction"]
                    final_item["confidence"][0]["lower"] += dict_ele["confidence"][0]["lower"]
                    final_item["confidence"][0]["upper"] += dict_ele["confidence"][0]["upper"]
        else:
            final_dict = deepcopy(dict_ele)
            final_dict_list.append(final_dict)
            processed_id.append(dict_ele["id"])


numer_of_items = len(input)
for item in final_dict_list:
    item["prediction"] /= numer_of_items
    item["confidence"][0]["lower"] /= numer_of_items
    item["confidence"][0]["upper"] /= numer_of_items

print(final_dict_list)

输出:

[
{'confidence': [{'upper': 6.2, 'lower': 4.5, 'percentile': '75'}], 'id': 713, 'prediction': 5.3},
{'confidence': [{'upper': 4.7, 'lower': 3.4000000000000004, 'percentile': '75'}], 'id': 714, 'prediction': 3.936893921359024}]

仅此一点,如果创建的数据结构稍有不同,可能会容易得多。