计算数据框中两个连续日期的统计信息之间的百分比差异,并根据条件

时间:2018-04-18 07:36:59

标签: python-3.x pandas

我有一个csv文件,其中包含某些日期的数据( 2018-02-11 2018-03-14 )。

,date,location,device,provider,cpu,mem,load,drops,id,latency,gw_latency,upload,download,sap_drops,sap_latency,alert_id
0,2018-02-12 11:52:59.342269+00:00,WEO,10.11.100.1,POP,6.0,23.0,11.75,0.0,,,,,,,,
1,2018-02-13 11:53:04.006971+00:00,COO,10.11.100.1,BOP,6.0,23.0,4.58,0.0,,,,,,,,
2,2018-02-14 11:52:59.342269+00:00,,,COO,,,10.45,,,,,,,,,
3,2018-02-15 09:52:59.342269+00:00,,,DOP,,,12.45,,,,,,,,,
4,2018-02-16 04:52:59.342269+00:00,,,RRE,,,9.45,,,,,,,,,
5,2018-02-17 05:52:59.342269+00:00,,,WEQ,,,5.45,,,,,,,,,

现在,我希望连续两个日期找到meanminmaxstd,并计算出的优势差异并检查阈值。所以任何列值,如果我发现百分比差异为 20%或更多,我会将该列值发布到csv文件中。

我已连续两个日期2018-02-122018-02-13完成了这项工作,并找出了每个日期的统计信息以及计算出的差异百分比。这是我的代码

df = pd.read_csv("metrics.csv", parse_dates=["date"])

df.set_index("date", inplace=True)

# get the stats for the date 2018-02-12
df_prev = df.loc['2018-02-12'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
                                             'gw_latency', 'upload', 'download', 'sap_drops',
                                             'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)

# get the stats for the date 2018-02-13
df_next = df.loc['2018-02-13'].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
                                             'gw_latency', 'upload', 'download', 'sap_drops',
                                             'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)

# calculate the percentage difference
df_diff_pt = abs(df_next - df_prev.values)/(df_prev.values) * 100
df_diff_pt.to_csv("percentage_diff.csv", index=False)

我得到以下输出

cpu cpu cpu cpu mem mem mem mem load    load    load    load    drops   drops   drops   drops   latency latency latency latency gw_latency  gw_latency  gw_latency  gw_latency  upload  upload  upload  upload  download    download    download    download    sap_drops   sap_drops   sap_drops   sap_drops   sap_latency sap_latency sap_latency sap_latency
mean    min max std mean    min max std mean    min max std mean    min max std mean    min max std mean    min max std mean    min max std mean    min max std mean    min max std mean    min max std
20.25266967     9.375   5.406603424 0.5193349753        0   0.5944589255    20.31451491     3.544110148 2.184989728 190.2821256     0   76.67007734 3.85929503  19.89528796 17.31689683 2.697415388 1.680556319 0   19.34731935 4.084268605 14.86356963     23.19968083 10.35004075 24.58650424     7.780228594 9.740543925 4.47444575      0   0.4689312965    0.2667648736    0   29.78723404 14.15288291

正如您在cpu mean所看到的那样,它已经跳过阈值,因此其他指标也是如此。

现在我想为每对连续日期([2018-02-11, 2018-02-12],[2018-02-12, 2018-02-13] ....)执行此操作,每当我找到任何指标时,stat值都超过阈值(20%),我附加到csv文件并继续操作。

但是根据我目前的方法,我只能手动输入两个日期并将结果放入csv文件,然后进一步检查阈值违规。这意味着我将为每对日期创建一个.csv。我想要在运行中进行并获得一个带有预期结果的最终.csv文件。我该怎么办?

一种方法是遍历数据框并选择日期并进行比较

for i in df.index:
    for j in pd.to_timedelta(i, unit='D'):
        df_prev = df.loc[i].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
                                             'gw_latency', 'upload', 'download', 'sap_drops',
                                             'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)

        df_next = df.loc[j].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
                                             'gw_latency', 'upload', 'download', 'sap_drops',
                                             'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)

        df_diff_pt = abs(df_next - df_prev.values) / (df_prev.values) * 100
        break

    #further operations

但我收到以下错误

ValueError: Invalid type for timedelta scalar: <class 'pandas._libs.tslib.Timestamp'>

1 个答案:

答案 0 :(得分:2)

我认为更好的是按1 dayshift行创建一个DataFrame,删除最后一行,因为与不存在的下一个值进行比较,并通过tresh使用any进行过滤以进行检查条件每行的值:

df1 = df.resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
                     'gw_latency', 'upload', 'download', 'sap_drops',
                      'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)

tresh = 50
df11 = df1.shift(freq='d')
df2 = df1.sub(df11).abs().div(df11, fill_value=1).mul(100).iloc[:-1]
df2 = df2[(df2 > tresh).any(1)]
df2.to_csv("percentage_diff.csv", index=False)

你的循环解决方案应该是:

dfs = []
for i in np.unique(df.index.strftime('%Y-%m-%d'))[:-1]:
    j = (pd.Timestamp(i) + pd.Timedelta(1, unit='d')).strftime('%Y-%m-%d')
    df_prev = df.loc[i].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
                                             'gw_latency', 'upload', 'download', 'sap_drops',
                                             'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)

    df_next = df.loc[j].resample('D')['cpu', 'mem', 'load', 'drops', 'latency',
                                             'gw_latency', 'upload', 'download', 'sap_drops',
                                             'sap_latency'].agg(['mean', 'min', 'max', 'std']).fillna(0)

    df_diff_pt = abs(df_next - df_prev.values) / (df_prev.values) * 100
    df_diff_pt = df_diff_pt[(df_diff_pt > tresh).any(1)]

    if not df_diff_pt.empty:
        dfs.append(df_diff_pt)
df2 = pd.concat(dfs).to_csv("percentage_diff2.csv", index=False)