如何按列分组并计算到日期列的滚动平均值

时间:2019-12-31 04:21:12

标签: python pandas-groupby data-science data-analysis rolling-computation

样本数据:

import pandas as pd
df = pd.DataFrame({'A_confidence':[100,100,100,100,100,100,100,100,85,85,85,0,100,95,100,80,100,100,100,100,85,100,100,100,100,100,100,100,0,100,100,100,100,100,100,100,100,0,0,0,0,100,100,100,100,100,0,100,100],'B_confidence':[100,100,100,99.972,100,100,100,100,100,100,100,0,100,100,100,70,100,100,0,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,100,98,100,0,0,0,0,100,0,0,100,0,85,0,0],'C_confidence':[40,40,0,91.3,93,93,93,40,0,0,0,0,99,0,0,0,97.6,96.3,0,96.3,0,0,0,98.8,0,99.2,99.2,99.2,0,0,97.8,97.8,40,0,0,97.5,97.2,0,0,0,0,0,0,0,99,0,0,0,0],'D_confidence':[30,30,30,0,30,30,30,97.5,0,0,0,0,98.5,99.6,0,30,98.7,30,0,30,30,30,30,99.2,30,30,30,30,30,91.5,30,30,0,30,30,93.8,99,0,0,0,30,97.8,0,0,99.3,0,0,0,0],'A_name':['EEEEE','EEEEE','EEEEE','EEEEE','EEEEE','EEEEE','EEEEE','DDDDD','DDDDD','DDDDD','DDDDD','NULL','CCCCC','DDDDD','CCCCC','EEEEE','CCCCC','EEEEE','EEEEE','EEEEE','BBBBB','CCCCC','CCCCC','CCCCC','CCCCC','DDDDD','DDDDD','DDDDD','NULL','EEEEE','AAAAA','AAAAA','EEEEE','CCCCC','CCCCC','CCCCC','CCCCC','NULL','NULL','NULL','NULL','CCCCC','EEEEE','EEEEE','AAAAA','DDDDD','NULL','BBBBB','BBBBB'],'B_name':['zzzzz','zzzzz','xxxxx','yyyyy','mmmmm','mmmmm','mmmmm','nnnnn','qqqqq','qqqqq','qqqqq','NULL','ppppp','sssss','qqqqq','qqqqq','eeeee','jjjjj','NULL','jjjjj','qqqqq','qqqqq','qqqqq','ppppp','qqqqq','kkkkk','kkkkk','kkkkk','qqqqq','jjjjj','iiiii','iiiii','adfgh','qqqqq','qqqqq','adfgh','ppppp','NULL','NULL','NULL','NULL','ppppp','NULL','NULL','adfgh','NULL','adfgh','NULL','NULL'],'Time':['39:14.3','39:14.3','44:14.2','53:23.3','48:09.2','48:09.2','48:09.2','59:33.1','45:56.9','45:56.9','45:56.9','10:18.9','30:47.0','52:10.7','02:06.4','27:05.7','34:05.0','43:25.7','44:10.1','43:25.7','02:55.0','08:54.1','08:54.1','04:24.3','08:54.1','26:53.7','26:53.7','26:53.7','54:46.6','58:07.0','58:52.2','58:52.2','36:40.9','43:55.4','43:55.4','07:48.8','25:37.3','10:01.3','21:11.9','21:11.9','42:52.3','56:04.8','59:47.3','59:47.3','00:58.5','01:44.1','12:28.4','30:08.8','30:08.8']})

我一直在搜索论坛,但没有找到一种方法来确定执行滚动平均值以获取平均值(7天)的方法。 预期:

1. group unique combinations of A_name and B_name(wrt to date)
2. Calculate the rolling average for 7 days based on time(uploaded time) for
a. "A_confidence",
b. "B_confidence",
c. "C_confidence",
d. "D_confidence",
e. "E_confidence",
f. "F_confidence"

时间:是需要格式化的日期列。 在线示例未显示groupby和多列滚动平均值的组合。 提前致谢 时间栏确实存在,请在投票之前仔细查看图像。

Same dataframe loaded on my local

> A_confidence  B_confidence    C_confidence    D_confidence    A_name  B_name  Time
> 0 100 100.000 40.0    30.0    EEEEE   zzzzz   39:14.3
> 1 100 100.000 40.0    30.0    EEEEE   zzzzz   39:14.3
> 2 100 100.000 0.0 30.0    EEEEE   xxxxx   44:14.2
> 3 100 99.972  91.3    0.0 EEEEE   yyyyy   53:23.3
> 4 100 100.000 93.0    30.0    EEEEE   mmmmm   48:09.2
> 5 100 100.000 93.0    30.0    EEEEE   mmmmm   48:09.2
> 6 100 100.000 93.0    30.0    EEEEE   mmmmm   48:09.2

2 个答案:

答案 0 :(得分:0)

从这个问题还不清楚,call_type_id列的作用是什么。在您澄清之前,我将不予理

没有窗口功能:

这是一个简单的变体,根本不使用窗口函数。

确保在(call_dt_key,aes_raw)上有索引。

CTE_Dates返回表中所有日期的列表,并计算每天的平均值。第一天需要这个average_current_day。服务器将以任何方式扫描整个索引,因此计算这种平均值很便宜。

然后,对于每个不同的日子,我都使用自加入来计算前40天的平均值。这将在第一天返回NULL,在主查询中将其替换为average_current_day。

您不必在这里使用CTE,它只是使查询更易于阅读。

CTE_Dates

AS

from PyQt5.QtWidgets import *
import sys

databases = []

# Randomly generated database names
for i in range(10):
    databases.append('db' + str(i))

db_count = len(databases)
row_num = 0

class App(QDialog):

    def __init__(self):
        super().__init__()
        self.title = 'MySQL Timing Discrepancies'
        self.left = 10
        self.top = 10
        self.width = 320
        self.height = 100
        self.initUI()

    def initUI(self):
        self.setWindowTitle(self.title)
        self.setGeometry(self.left, self.top, self.width, self.height)

        self.createGridLayout()

        windowLayout = QVBoxLayout()
        windowLayout.addWidget(self.horizontalGroupBox)
        self.setLayout(windowLayout)

        self.show()

    def createGridLayout(self):
        self.horizontalGroupBox = QGroupBox("Databases")
        layout = QGridLayout()
        layout.setColumnStretch(0, 0)
        layout.setColumnStretch(1, 0)
        layout.setColumnStretch(2, 0)

        # Generating checkboxes here
        for i in range(db_count):
            col_num = i % 3
            row_num = i if col_num == 0 else row_num
            cb = QCheckBox(databases[i])
            cb.setChecked(False)
            layout.addWidget(cb, row_num, col_num)

            # Does nothing because they are defaulted to False
            if cb.isChecked():
                print(cb.text())

        self.horizontalGroupBox.setLayout(layout)


if __name__ == '__main__':
    app = QApplication(sys.argv)
    ex = App()
    sys.exit(app.exec_())

选择

SELECT

    call_dt_key

    ,call_dt_key - INTERVAL '41 day' AS dt_from

    ,call_dt_key - INTERVAL '1 day' AS dt_to

    ,AVG(test_aes.aes_raw) AS average_current_day

FROM test_aes

GROUP BY call_dt_key

CTE_Dates.call_dt_key

,COALESCE(prev40.average_40, CTE_Dates.average_current_day) AS average_40

ORDER BY call_dt_key;

结果:

| call_dt_key | average_40 |

| -------------------------------- | ----------------- --- |

| 2016年1月1日00:00:00 | 15 |

| 2016年1月5日00:00:00 | 15 |

| 2016年1月10日00:00:00 | 15.333333333333334 |

| 2016年1月15日00:00:00 | 15.294117647058824 |

| 2016年1月16日00:00:00 | 15.5 |

| 2016年1月20日00:00:00 | 15.652173913043478 |

| 2016年1月21日00:00:00 | 15.6 |

| 2016年1月31日00:00:00 | 15.555555555555555 |

| 2016年2月1日00:00:00 | 15.517241379310345 |

| 2016年2月10日00:00:00 | 15.483870967741936 |

| 2016年2月15日00:00:00 | 15.652173913043478 |

| 2016年2月26日00:00:00 | 15.333333333333334 |

| 2016年3月4日00:00:00 | 15 |

| 2016年3月18日00:00:00 | 15 |

答案 1 :(得分:0)

很多问题尚不清楚,示例数据中没有日期列。但是,假设您要执行多列分组,然后在具有指定窗口的几列上执行均值滚动,那么以下示例希望它会很有用:

df.groupby(['A_name','B_name'])['A_confidence','B_confidence','C_confidence'].rolling(window=3).mean()

这将给出以下结果,该结果按A_name和B_name分组,然后对

执行滚动平均

这是结果。

enter image description here