Question

我具有以下结构：

[
    {
        "Name": "a-1",
        "Tags": [
            {
                "Value": "a", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-02-25T17:33:19.000Z"
    },
    {
        "Name": "a-2",
        "Tags": [
            {
                "Value": "a", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-02-26T17:33:19.000Z"
    },
    {
        "Name": "b-1",
        "Tags": [
            {
                "Value": "b", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-01-21T17:33:19.000Z"
    },
    {
        "Name": "b-2",
        "Tags": [
            {
                "Value": "b", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-01-22T17:33:19.000Z"
    },
    {
        "Name": "c-1",
        "Tags": [
            {
                "Value": "c", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-08-29T17:33:19.000Z"
    }
]

当组中成员多于一个时，我想打印出每个Name中最旧的Value（这应该是可配置的。例如：x个最旧的项y个成员）。在这种情况下，有两个a，两个b和一个c，因此预期结果将是：

 a-1
 b-1

如果我的Python代码是：

data = ec2.describe_images(Owners=['11111'])
images = data['Images']
grouper = groupby(map(itemgetter('Tags'), images))
groups = (list(vals) for _, vals in grouper)
res = list(chain.from_iterable(filter(None, groups)))

当前res仅包含Key和Value的列表，并且没有分组。任何人都可以向我展示如何继续执行代码以达到预期结果吗？

Answer 1

这是一个使用熊猫的解决方案，它使用json字符串作为输入（json_string）

很多时候，熊猫是过大的，但是在这里，我认为这会很好，因为您基本上想按值分组，然后根据标准（例如拥有多少成员）来消除一些分组

import pandas as pd

# load the dataframe from the json string
df = pd.read_json(json_string)
df['CreationDate'] = pd.to_datetime(df['CreationDate'])

# create a value column from the nested tags column
df['Value'] = df['Tags'].apply(lambda x: x[0]['Value'])

# groupby value and iterate through groups
groups = df.groupby('Value')
output = []
for name, group in groups:
    # skip groups with fewer than 2 members
    if group.shape[0] < 2:
        continue

    # sort rows by creation date
    group = group.sort_values('CreationDate')

    # save the row with the most recent date
    most_recent_from_group = group.iloc[0]
    output.append(most_recent_from_group['Name'])

print(output)

通过sub-json的元素聚合json

1 个答案: