通过sub-json的元素聚合json

时间:2018-09-04 15:41:55

标签: python json

我具有以下结构:

[
    {
        "Name": "a-1",
        "Tags": [
            {
                "Value": "a", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-02-25T17:33:19.000Z"
    },
    {
        "Name": "a-2",
        "Tags": [
            {
                "Value": "a", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-02-26T17:33:19.000Z"
    },
    {
        "Name": "b-1",
        "Tags": [
            {
                "Value": "b", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-01-21T17:33:19.000Z"
    },
    {
        "Name": "b-2",
        "Tags": [
            {
                "Value": "b", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-01-22T17:33:19.000Z"
    },
    {
        "Name": "c-1",
        "Tags": [
            {
                "Value": "c", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-08-29T17:33:19.000Z"
    }
]

当组中成员多于一个时,我想打印出每个Name中最旧的Value(这应该是可配置的。例如:x个最旧的项y个成员)。在这种情况下,有两个a,两个b和一个c,因此预期结果将是:

 a-1
 b-1

如果我的Python代码是:

data = ec2.describe_images(Owners=['11111'])
images = data['Images']
grouper = groupby(map(itemgetter('Tags'), images))
groups = (list(vals) for _, vals in grouper)
res = list(chain.from_iterable(filter(None, groups)))

当前res仅包含KeyValue的列表,并且没有分组。任何人都可以向我展示如何继续执行代码以达到预期结果吗?

1 个答案:

答案 0 :(得分:0)

这是一个使用熊猫的解决方案,它使用json字符串作为输入(json_string

很多时候,熊猫是过大的,但是在这里,我认为这会很好,因为您基本上想按值分组,然后根据标准(例如拥有多少成员)来消除一些分组

import pandas as pd

# load the dataframe from the json string
df = pd.read_json(json_string)
df['CreationDate'] = pd.to_datetime(df['CreationDate'])

# create a value column from the nested tags column
df['Value'] = df['Tags'].apply(lambda x: x[0]['Value'])

# groupby value and iterate through groups
groups = df.groupby('Value')
output = []
for name, group in groups:
    # skip groups with fewer than 2 members
    if group.shape[0] < 2:
        continue

    # sort rows by creation date
    group = group.sort_values('CreationDate')

    # save the row with the most recent date
    most_recent_from_group = group.iloc[0]
    output.append(most_recent_from_group['Name'])

print(output)