我试图弄清楚如何可视化一些传感器数据。我每5分钟收集一次用于多个设备的数据,这些数据存储在一个看起来像这样的JSON结构中(请注意,我无法控制该数据结构):
[
{
"group": { "id": "01234" },
"measures": {
"measures": {
"...device 1 uuid...": {
"metric.name.here": {
"mean": [
["2019-04-17T14:30:00+00:00", 300, 1],
["2019-04-17T14:35:00+00:00", 300, 2],
...
]
}
},
"...device 2 uuid...": {
"metric.name.here": {
"mean": [
["2019-04-17T14:30:00+00:00", 300, 0],
["2019-04-17T14:35:00+00:00", 300, 1],
...
]
}
}
}
}
}
]
格式为["2019-04-17T14:30:00+00:00", 300, 0]
的每个元组为[timestamp, granularity, value]
。设备按项目ID分组。在任何给定的组中,我都希望获取多个设备的数据并将它们加在一起。例如,对于上述示例数据,我希望最终的序列看起来像这样:
["2019-04-17T14:30:00+00:00", 300, 1],
["2019-04-17T14:35:00+00:00", 300, 3],
系列的长度不一定相同。
最后,我想将这些度量汇总到每小时样本中。
我可以这样获得单个系列:
with open('data.json') as fd:
data = pd.read_json(fd)
for i, group in enumerate(data.group):
project = group['project_id']
instances = data.measures[i]['measures']
series_for_group = []
for instance in instances.keys():
measures = instances[instance][metric][aggregate]
# build an index from the timestamps
index = pd.DatetimeIndex(measure[0] for measure in measures)
# extract values from the data and link it to the index
series = pd.Series((measure[2] for measure in measures),
index=index)
series_for_group.append(series)
在外部for
循环的底部,我有一个pandas.core.series.Series
对象的数组,它们表示与当前组关联的不同度量集。我希望可以像total = sum(series_for_group)
那样将它们简单地加在一起,但是会产生无效的数据。
我什至可以正确读取此数据吗?这是我第一次与熊猫合作。我不确定(a)创建索引,然后(b)填充数据是否是此处的正确过程。
我将如何成功地将这些系列加在一起?
如何将数据重新采样为1小时间隔?看着this question似乎似乎很感兴趣.groupby
和.agg
方法,但是从该示例中尚不清楚如何指定间隔大小。
更新1
也许我可以使用concat
和groupby
?例如:
final = pd.concat(all_series).groupby(level=0).sum()
答案 0 :(得分:1)
我在评论中建议执行以下操作:
result = pd.DataFrame({}, columns=['timestamp', 'granularity', 'value',
'project', 'uuid', 'metric', 'agg'])
for i, group in enumerate(data.group):
project = group['id']
instances = data.measures[i]['measures']
series_for_group = []
for device, measures in instances.items():
for metric, aggs in measures.items():
for agg, lst in aggs.items():
sub_df = pd.DataFrame(lst, columns = ['timestamp', 'granularity', 'value'])
sub_df['project'] = project
sub_df['uuid'] = device
sub_df['metric'] = metric
sub_df['agg'] = agg
result = pd.concat((result,sub_df), sort=True)
# parse date:
result['timestamp'] = pd.to_datetime(result['timestamp'])
所生成的数据看起来像这样
agg granularity metric project timestamp uuid value
0 mean 300 metric.name.here 01234 2019-04-17 14:30:00 ...device 1 uuid... 1
1 mean 300 metric.name.here 01234 2019-04-17 14:35:00 ...device 1 uuid... 2
0 mean 300 metric.name.here 01234 2019-04-17 14:30:00 ...device 2 uuid... 0
1 mean 300 metric.name.here 01234 2019-04-17 14:35:00 ...device 2 uuid... 1
然后您可以进行整体汇总
result.resample('H', on='timestamp').sum()
给出:
timestamp
2019-04-17 14:00:00 4
Freq: H, Name: value, dtype: int64
或分组汇总:
result.groupby('uuid').resample('H', on='timestamp').value.sum()
给出:
uuid timestamp
...device 1 uuid... 2019-04-17 14:00:00 3
...device 2 uuid... 2019-04-17 14:00:00 1
Name: value, dtype: int64
答案 1 :(得分:0)
要从具有不同长度(例如s1,s2,s3)的系列构造一个数据帧(df),您可以尝试:
df=pd.concat([s1,s2,s3], ignore_index=True, axis=1).fillna('')
构建数据框后:
确保所有日期都存储为时间戳记对象:
df ['Date'] = pd.to_datetime(df ['Date'])
然后,添加另一列以从日期列中提取小时数:
df['Hour']=df['Date'].dt.hour
然后按小时分组并汇总值:
df.groupby('Hour').sum()
答案 2 :(得分:0)
基于问题代码,我最终得出了一个可行的解决方案。在我的系统上,这大约需要6秒钟来处理大约85MB的输入数据。相比之下,我在5分钟后取消了Quang的代码。
我不知道这是否是处理这些数据的正确方法,但是它会产生明显正确的结果。我注意到,像在此解决方案中那样,构建一系列列表,然后进行单个pd.concat
调用比将pd.concat
放入循环中更有效。
#!/usr/bin/python3
import click
import matplotlib.pyplot as plt
import pandas as pd
@click.command()
@click.option('-a', '--aggregate', default='mean')
@click.option('-p', '--projects')
@click.option('-r', '--resample')
@click.option('-o', '--output')
@click.argument('metric')
@click.argument('datafile', type=click.File(mode='rb'))
def plot_metric(aggregate, projects, output, resample, metric, datafile):
# Read in a list of project id -> project name mappings, then
# convert it to a dictionary.
if projects:
_projects = pd.read_json(projects)
projects = {_projects.ID[n]: _projects.Name[n].lstrip('_')
for n in range(len(_projects))}
else:
projects = {}
data = pd.read_json(datafile)
df = pd.DataFrame()
for i, group in enumerate(data.group):
project = group['project_id']
project = projects.get(project, project)
devices = data.measures[i]['measures']
all_series = []
for device, measures in devices.items():
samples = measures[metric][aggregate]
index = pd.DatetimeIndex(sample[0] for sample in samples)
series = pd.Series((sample[2] for sample in samples),
index=index)
all_series.append(series)
# concatenate all the measurements for this project, then
# group them using the timestamp and sum the values.
final = pd.concat(all_series).groupby(level=0).sum()
# resample the data if requested
if resample:
final = final.resample(resample).sum()
# add series to dataframe
df[project] = final
fig, ax = plt.subplots()
df.plot(ax=ax, figsize=(11, 8.5))
ax.legend(frameon=False, loc='upper right', ncol=3)
if output:
plt.savefig(output)
plt.close()
else:
plt.show()
if __name__ == '__main__':
plot_metric()