在pandas中使用.agg返回' list对象没有属性' agg''

时间:2017-11-17 12:12:52

标签: python json pandas pandas-groupby

我有多个物联网设备的数据点。现在我想聚合每个设备的数据并将其放在JSON格式中。

原始数据也以JSON形式出现,以下是格式示例。

[{'EventProcessedUtcTime': '2017-11-14T13:52:56.5578743Z',
'IoTHub': {'ConnectionDeviceGenerationId': '636446013395056929',
'ConnectionDeviceId': 'TestDevice3',
'CorrelationId': 'correlation_0',
'EnqueuedTime': '2017-11-14T13:52:54.0670000Z',
'MessageId': 'message_0',
'StreamId': None},
'PartitionId': 0,
'created': '14:52:53.871053',
'data': {'ack': 'true',
'bandwidth': 125,
'created': '2017-10-10T11:50:44.3865120',
'device': {},
'device_id': 5,
'frame_counter': 14,
'inserted_at': '2017-10-10T11:50:44.3865120',
'location': {'coordinates': [5.491940453992788, 59.763636703175095],
'type': 'Point'},
'mic_pass': 'true',
'organization': {},
'organization_id': -1,
'parsed': {'Battery': 82,
'Pitch': 68383,
'Roll': 65172,
'Status': 1,
'Temperature': 3,
'altitude': 226,
'app': 1,
'gps': {'latitude': 59.76385800791303,
 'longitude': 5.491594108659354,
 'valid': 'true'}},
'parsed_packet': {'ack': 'false',
'adr': 'true',
'adrackreq': 'false',
'dev_addr_hex': '0000009b',
'dir': 'up',
'fcnt': 14,
'fopts_len': 0,
'mac_cmds': [],
'major': 0,
'mic_pass': 'true',
'mtype': 'confirmed_data_up',
'pending': 'false',
'port': 2},
'payload': '0409611354FF6503E7C1FFFF004EFFC4',
'payload_encrypted': 'false',
'port': 2,
'raw_payload': '809B000000800E00022176D281146DA7FEB86A6A7BF6D077F8CE3EC050',
'server_data': {'codr': '4/5',
'datr': 'SF12BW125',
'dev_addr_hex': '0000009b',
'fopts': '',
'freq': 868.5,
'gwrx': [{'ant': 0,
  'gweui': '7276FFFFFE0108AB',
  'lsnr': 7.5,
  'rssi': -103,
  'srv_rcv_time': 1507636243275017,
  'time': '2017-10-10T11:50:43.2755920Z',
  'tmst': 3644695844}],
'mac_cmds': [],
'mic_pass': 'true',
'modu': 'LORA',
'mtype': 'confirmed_data_up',
'raw': 'gJsAAACADgACIXbSgRRtp/64amp79tB3+M4+wFA=',
'size': 29},
'spreading_factor': 12,
 'uid': 'c82b7259-271a-43af-937a-30d703f91461',
 'updated_at': '2017-10-10T11:50:44.9509490'},
 'eventenqueuedutctime': '2017-11-14T13:52:55.5330000Z',
'parsed': {'Battery': 76,
'Pitch': 83356,
'Roll': 84511,
'Status': 10,
'Temperature': 12,
'altitude': 984,
'app': 1,
'gps': {'latitude': 59.763366373379675,
 'longitude': 5.491763030931904,
 'valid': 'true'}},
'slug': 'c82b7259-271a-43af-937a-30d703f91461',
'type': 'up_packets'}]

对于文字墙感到抱歉,但我想提供完整的背景信息。为了使用pandas,我使用

平铺了数据
data_json = pd.DataFrame.to_json(data)

我感兴趣的数据中的值是:

'parsed': {'Battery': 76,
'Pitch': 83356,
'Roll': 84511,
'Status': 10,
'Temperature': 12,
'altitude': 984,

我想为我拥有的所有设备聚合这些值,并将其分组到单个设备上。到目前为止,我所做的代码如下:

k = (df.groupby(['deviceid','battery','temperature','altitude','roll'].agg('min'), as_index=False)
             .apply(lambda x: x[['EventProcessedUtcTime','deviceid']].to_dict('r'))
             .reset_index()
             .rename(columns={0:'Device Timestamp'})
             .to_json(orient='records'))

如果我删除.agg,我会得到一个包含嵌套JSON文件中每个事件的json文件,它按设备ID顺序排列。我尝试使用.agg函数时得到的错误是:

AttributeError: 'list' object has no attribute 'agg'

如果我使用元组或词典,也会发生同样的情况。你们中的任何人都知道如何解决这个问题吗?

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

好的,我明白了。它不起作用的原因是因为.agg函数增加了多索引。因此,生成的数据帧不是“平坦的”数据帧。解决方法如下(我还添加了一些距离计算):

import json
import pandas as pd
import numpy as np
import datetime

timestamp = datetime.datetime.now()

data = open('tester.json').read()
data = json.loads(data)
df = pd.io.json.json_normalize(data)
func = (np.min, np.max, np.mean)
npsum = (np.sum)

def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)

All args must be of equal length.    

"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

dlon = lon2 - lon1
dlat = lat2 - lat1

a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

c = 2 * np.arcsin(np.sqrt(a))
meters = 6367 * c * 1000
return meters

df = df.rename(columns{'data.device_id':'deviceid','parsed.Battery':'battery','parsed.altitude':'altitude','parsed.Pitch':'pitch','parsed.Roll':'roll','parsed.Temperature':'temperature','parsed.gps.latitude':'latitude','parsed.gps.longitude':'longitude'})

df['distance'] = \
haversine_np(df.longitude.shift(), df.latitude.shift(),
             df.loc[1:,'longitude'], df.loc[1:,'latitude'])

jsonoutput = (df.groupby('deviceid').agg({'battery':func,'temperature':func,'altitude':func,'roll':func,'pitch':func,'distance':npsum}))
jsonoutput = jsonoutput.reset_index()

# The statement above is what results in a multi-indexed dataframe.

mi = jsonoutput.columns
mi.tolist()

ind = pd.Index([e[0] + e[1] for e in mi.tolist()])

jsonoutput.columns = ind

outputfile = (jsonoutput.groupby(['batteryamin','batteryamax','batterymean',
                       'temperatureamin','temperatureamax','temperaturemean',
                      'altitudeamin','altitudeamax','altitudemean',
                      'rollamin','rollamax','rollmean',
                      'pitchamin','pitchamax','pitchmean','distancesum'], as_index=False)
             .apply(lambda x: x[['deviceid']].to_dict('r'))
             .reset_index()
             .rename(columns={0:'Device Aggregates for timestamp {}'.format(timestamp)})
             .to_json(orient='records'))


print(json.dumps(json.loads(outputfile), indent=2, sort_keys=True))

[
{
"Device Aggregates for timestamp 2017-11-20 10:35:43.886339": [
  {
    "deviceid": 1
  }
],
"altitudeamax": 989,
"altitudeamin": 18,
"altitudemean": 469.4142857143,
"batteryamax": 94,
"batteryamin": 1,
"batterymean": 42.2142857143,
"distancesum": 3198.4394034326,
"pitchamax": 99484,
"pitchamin": 434,
"pitchmean": 51750.5428571429,
"rollamax": 96710,
"rollamin": 2235,
"rollmean": 45421.0,
"temperatureamax": 25,
"temperatureamin": -15,
"temperaturemean": 4.2714285714
}]

希望这有助于有人试图做类似的事情。