查找字典中两个字段的最小值,最大值和平均值

时间:2019-01-28 07:39:55

标签: python google-cloud-dataflow apache-beam

处理完数据后,我有一批格式为

的行
(u'378491520468_sale', {'price': 2100000, 'built': 3815})

(u'378491119.1537520468_sale', {'price': 2100000, 'built': 3815})

(u'1306084076.1535728358_rent', {'price': 1400, 'built': 1109})

(u'1303342766.1548320090_sale', {'price': 550, 'built': 1200})

(u'1890530682.1515660872_sale', {'price': 130000, 'built': 759})

(u'8212134.1548317851_rent', {'price': 2900, 'built': 1220})

(u'1170655463.1513653914_sale', {'price': 430000, 'built': 1142})

(u'58676746.1548308550_sale', {'price': 1700000, 'built': 3000})

(u'1162578480.1474216313_sale', {'price': 10000000, 'built': 3})

(u'1860145003.1546594155_rent', {'price': 4200, 'built': 839})

(u'1640943061.1489124089_sale', {'price': 710000, 'built': 1600})

(u'1008351255.1547539066_rent', {'price': 15000, 'built': 8400})

(u'903442891.1547795833_sale', {'price': 148000, 'built': 786})

其中集合中的第一个元素是唯一ID。

我了解基本的CombineFn类,该类能够对(键,值)进行分组并在固定窗口中计算最小值,最大值和平均值。但是以字典作为值,我需要一些指导来以以下格式计算它们:

("the_unique_id", {
            "price":{
                "min": 0,
                "max": 0,
                "average": 0
            },
            "built": {
                "min": 0,
                "max": 0,
                "average": 0
            }
        ), ...

1 个答案:

答案 0 :(得分:0)

如果您可以将数据放入下面的表格中,这是一种计算合计值的方法:

import pandas as pd

data = {'ID': [u'378491520468_sale', u'378491119.1537520468_sale', u'1306084076.1535728358_rent'],
        'price': [2100000, 2100000, 1400],
        'built': [3815, 3815, 1109]}

df = pd.DataFrame(data)

aggregates = {
    'price': ['min', 'max', 'mean'],
    'built': ['min', 'max', 'mean'],
}

df = df.groupby('ID').agg(aggregates)

res = []

for i in range(len(df)):
    row = df.iloc[i]
    res.append((row.name,
                {'price': {'min': row['price']['min'],
                           'max': row['price']['max'],
                           'average': row['price']['mean']},
                 'built': {'min': row['built']['min'],
                           'max': row['built']['max'],
                           'average': row['built']['mean']}}))

print(res)