Question

我们如何将两个数据框与具有嵌套字典的列合并。在“操作”列中使用 df2 更新 df1。有没有办法通过使用 concat、append 和 merge 等可用方法来实现这一点？

df1 = pd.DataFrame([
    {
        "id": "87c4b5a0db9f49c49f766436c9582297",
        "actions": {
            "sample": [
                {
                    "tagvalue": "test",
                    "status": "created"
                },
                {
                    "tagvalue": "test2",
                    "status": "created"
                }
            ]
        }
    },
    {
        "id": "87c4b5a0db9f49c49f766436c9582298",
        "actions": {
            "sample": [
                {
                    "tagvalue": "test2",
                    "status": "created"
                }
            ]
        }
    }
])


df2 = pd.DataFrame([
    {
        "id": "87c4b5a0db9f49c49f766436c9582297",
        "actions": {
            "sample": [
                {
                    "tagvalue": "test",
                    "status": "updated"
                }
            ]
        }
    }
])

df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)


# Need to merge the data based on id
# TODO : Right way to merge to get the following output

finalOutputExpectaion = [
    {
        "id": "87c4b5a0db9f49c49f766436c9582297",
        "actions": {
            "sample": [
                {
                    "tagvalue": "test",
                    "status": "updated"
                },
                {
                    "tagvalue": "test2",
                    "status": "created"
                }
            ]
        }
    },
    {
        "id": "87c4b5a0db9f49c49f766436c9582298",
        "actions": {
            "sample": [
                {
                    "tagvalue": "test2",
                    "status": "created"
                }
            ]
        }
    }
]

注意：finalOutputExpectaion-将数据帧更新为dict（我们将通过使用to_dict(orient=records)来获取它） Python版本：3.7，熊猫版本：1.1.0

Answer 1

首先 join df1 上的数据框 df2 和 id，然后在列表推导式中 zip 列actions 从左到右数据框并使用自定义的 merge 函数来更新字典：

def merge(d1, d2):
    if pd.isna(d1) or pd.isna(d2):
        return d1

    tags = set(d['tagvalue'] for d in d2['sample'])
    d2['sample'] += [d for d in d1['sample'] if d['tagvalue'] not in tags]
    return d2

m = df1.join(df2, lsuffix='', rsuffix='_r')
df1['actions'] = [merge(*v) for v in zip(m['actions'], m['actions_r'])]

结果：

                                  actions
id                                                                                                                                   
87c4b5a0db9f49c49f766436c9582297  {'sample': [{'tagvalue': 'test', 'status': 'updated'}, {'tagvalue': 'test2', 'status': 'created'}]}
87c4b5a0db9f49c49f766436c9582298                                             {'sample': [{'tagvalue': 'test2', 'status': 'created'}]}

将两个数据框与嵌套字典合并

1 个答案: