我有一个如下的时间序列数据集。我想将其拆分为多个20个bin,获取每个bin中的最小和最大时间戳,并根据是否存在至少1个成功结果(成功:result = 0;失败:result)向每个bin添加一个标志= 1)
data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
{"product": "abc", "test_tstamp": 1530693405, "result": 0},
{"product": "abc", "test_tstamp": 1530693410, "result": 1},
{"product": "abc", "test_tstamp": 1530693411, "result": 0},
{"product": "abc", "test_tstamp": 1530693415, "result": 0},
{"product": "abc", "test_tstamp": 1530693420, "result": 0},
{"product": "abc", "test_tstamp": 1530693430, "result": 0},
{"product": "abc", "test_tstamp": 1530693431, "result": 0}]
我可以使用pandas.cut()将数据分割为20秒的间隔,并获取每个bin的最小和最大时间戳记
import numpy as np
import pandas as pd
arange = np.arange(1530693398, 1530693440, 20)
data = [{"product": "abc", "test_tstamp": 1530693399, "result": 1},
{"product": "abc", "test_tstamp": 1530693405, "result": 0},
{"product": "abc", "test_tstamp": 1530693410, "result": 1},
{"product": "abc", "test_tstamp": 1530693411, "result": 0},
{"product": "abc", "test_tstamp": 1530693415, "result": 0},
{"product": "abc", "test_tstamp": 1530693420, "result": 1},
{"product": "abc", "test_tstamp": 1530693430, "result": 1},
{"product": "abc", "test_tstamp": 1530693431, "result": 1}]
df = pd.DataFrame(data)
df['bins'] = pd.cut(df['test_tstamp'], arange)
output_1 = df.groupby(["bins"]).agg({'result': np.ma.count, 'test_tstamp': {'mindate': np.min, 'maxdate': np.max}})
test_tstamp result
maxdate mindate count
bins
(1530693398, 1530693418] 1530693415 1530693399 5
(1530693418, 1530693438] 1530693431 1530693420 3
并能够使用result success
找到result failed
和groupby()
output_2 = df.groupby(["bins", "result"]).result.count()
result
bins result
(1530693398, 1530693418] 0 3
1 2
(1530693418, 1530693438] 0 3
我不确定如何将output_1
和output_2
组合在一起,因此我希望有result count
,result success
而不是上面的result failed
列和与每个flag
关联的bin
列。
预期输出:
test_tstamp result flag
maxdate mindate success failed
bins
(1530693398, 1530693418] 1530693415 1530693399 3 2 True
(1530693418, 1530693438] 1530693431 1530693420 0 3 False
任何指针都会有所帮助!谢谢!
答案 0 :(得分:1)
拆栈outptut_2
,然后合并两个输出:
output_2 = (
output_2
.unstack(fill_value=0)
.rename(columns={0 : 'success', 1 : 'failed'}))
df = (pd.concat([output_1.test_tstamp, output_2], axis=1, keys=['test_tstamp', 'result'])
.assign(flag=output_2.success.gt(0)))
test_tstamp result flag
result mindate maxdate success failed
bins
(1530693398, 1530693418] 1530693399 1530693415 3 2 True
(1530693418, 1530693438] 1530693420 1530693431 0 3 False