我有一个csv文件,其中包含某些日期的数据( 2018-02-11 至 2018-03-14 )。
,date,location,device,provider,cpu,mem,load,drops,id,latency,gw_latency,upload,download,sap_drops,sap_latency,alert_id
0,2018-02-12 11:52:59.342269+00:00,WEO,10.11.100.1,POP,6.0,23.0,11.75,0.0,,,,,,,,
1,2018-02-13 11:53:04.006971+00:00,COO,10.11.100.1,BOP,6.0,23.0,4.58,0.0,,,,,,,,
2,2018-02-14 11:52:59.342269+00:00,,,COO,,,10.45,,,,,,,,,
3,2018-02-15 09:52:59.342269+00:00,,,DOP,,,12.45,,,,,,,,,
4,2018-02-16 04:52:59.342269+00:00,,,RRE,,,9.45,,,,,,,,,
5,2018-02-17 05:52:59.342269+00:00,,,WEQ,,,5.45,,,,,,,,,
现在,我希望连续两个日期找到mean
,min
,max
和std
,并计算出的优势差异并检查阈值。所以任何列值,如果我发现百分比差异为 20%或更多,我会将该列值发布到csv文件中。
我已连续两个日期2018-02-12
和2018-02-13
完成了这项工作,并找出了每个日期的统计信息以及计算出的差异百分比。这是我的代码
df = pd.read_csv("metrics.csv", parse_dates=["date"])
df.set_index("date", inplace=True)
# get the stats for the date 2018-02-12
df_prev = df.loc['2018-02-12'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
# get the stats for the date 2018-02-13
df_next = df.loc['2018-02-13'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
# calculate the percentage difference
df_diff_pt = abs(df_next - df_prev.values)/(df_prev.values) * 100
df_diff_pt.to_csv("percentage_diff.csv", index=False)
我得到以下输出
cpu cpu cpu cpu mem mem mem mem load load load load drops drops drops drops latency latency latency latency gw_latency gw_latency gw_latency gw_latency upload upload upload upload download download download download sap_drops sap_drops sap_drops sap_drops sap_latency sap_latency sap_latency sap_latency
mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std mean min max std
20.25266967 9.375 5.406603424 0.5193349753 0 0.5944589255 20.31451491 3.544110148 2.184989728 190.2821256 0 76.67007734 3.85929503 19.89528796 17.31689683 2.697415388 1.680556319 0 19.34731935 4.084268605 14.86356963 23.19968083 10.35004075 24.58650424 7.780228594 9.740543925 4.47444575 0 0.4689312965 0.2667648736 0 29.78723404 14.15288291
正如您在cpu mean
所看到的那样,它已经跳过阈值,因此其他指标也是如此。
现在我想为每对连续日期([2018-02-11, 2018-02-12
],[2018-02-12, 2018-02-13
] ....)执行此操作,每当我找到任何指标时,stat值都超过阈值(20%),我附加到csv文件并继续操作。
但是根据我目前的方法,我只能手动输入两个日期并将结果放入csv文件,然后进一步检查阈值违规。这意味着我将为每对日期创建一个.csv。我想要在运行中进行并获得一个带有预期结果的最终.csv文件。我该怎么办?
一种方法是遍历数据框并选择日期并进行比较
for i in df.index:
for j in pd.to_timedelta(i, unit='D'):
df_prev = df.loc[i].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
df_next = df.loc[j].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
df_diff_pt = abs(df_next - df_prev.values) / (df_prev.values) * 100
break
#further operations
但我收到以下错误
ValueError: Invalid type for timedelta scalar: <class 'pandas._libs.tslib.Timestamp'>
答案 0 :(得分:2)
我认为更好的是按1 day
按shift
行创建一个DataFrame,删除最后一行,因为与不存在的下一个值进行比较,并通过tresh使用any
进行过滤以进行检查条件每行的值:
df1 = df.resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
tresh = 50
df11 = df1.shift(freq='d')
df2 = df1.sub(df11).abs().div(df11, fill_value=1).mul(100).iloc[:-1]
df2 = df2[(df2 > tresh).any(1)]
df2.to_csv("percentage_diff.csv", index=False)
你的循环解决方案应该是:
dfs = []
for i in np.unique(df.index.strftime('%Y-%m-%d'))[:-1]:
j = (pd.Timestamp(i) + pd.Timedelta(1, unit='d')).strftime('%Y-%m-%d')
df_prev = df.loc[i].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
df_next = df.loc[j].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
'gw_latency', 'upload', 'download', 'sap_drops',
'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)
df_diff_pt = abs(df_next - df_prev.values) / (df_prev.values) * 100
df_diff_pt = df_diff_pt[(df_diff_pt > tresh).any(1)]
if not df_diff_pt.empty:
dfs.append(df_diff_pt)
df2 = pd.concat(dfs).to_csv("percentage_diff2.csv", index=False)