Question

我正在将MSSQL数据库移至MYSQL的过程中，因此决定将一些存储过程移至Python而不是在MYSQL中进行重写。我正在Python 3.5.4上使用Pandas 0.23。

旧的MSSQL库使用许多窗口函数。到目前为止，我已经成功使用pandas.Dataframe.rolling使用Pandas进行了如下转换：

MSSQL

AVG([Close]) OVER (ORDER BY DateValue ROWS 13 PRECEDING) AS MA14

Python

df['MA14'] = df.Close.rolling(14).mean()

我一直在努力为python中MSSQL窗口函数的 PARTITION BY 部分提供解决方案。自发布以来，我一直在根据反馈pandas groupby开发一种解决方案...

https://pandas.pydata.org/pandas-docs/version/0.23.0/groupby.html

例如，假设MSSQL为：

AVG([Close]) OVER (PARTITION BY myCol ORDER BY DateValue ROWS 13 PRECEDING) AS MA14

到目前为止我已经完成的工作：

Col1包含我希望groupby并在rolling基础上应用函数的分类数据。还有一个日期列，因此Col1和date column将代表df中的唯一记录。

1。提供Col1的平均值，尽管已汇总

grouped = df.groupby(['Col1']).mean()
print(grouped.tail(20))

2。似乎在对Col1的每个分类组应用滚动平均值。我在哪

grouped = df.groupby(['Col1']).Close.rolling(14).mean()
print(grouped.tail(20))

3分配给df作为新的列RM

df['RM'] = df.groupby(['Col1']).Close.rolling(14).mean()
print(df.tail(20))

我不喜欢这一步，我得到了错误提示...

TypeError: incompatible index of inserted column with frame index

我已经编写了一个简单的示例，可能会有所帮助：

如何在＃1或类似版本的df中获取＃2的结果。

import numpy as np
import pandas as pd

dta = {'Colour': ['Red','Red','Blue','Blue','Red','Red','Blue','Red','Blue','Blue','Blue','Red'],
         'Year': [2014,2015,2014,2015,2016,2017,2018,2018,2016,2017,2013,2013],
         'Val':[87,78,863,673,74,81,756,78,694,701,804,69]}
df = pd.DataFrame(dta)
df = df.sort_values(by=['Colour','Year'], ascending=True)
print(df)

#1 add calculated columns to the df. This averages all of column Val
df['ValMA3'] = df.Val.rolling(3).mean().round(0)
print (df)


#2 Group by Colour. This is calculating average by groups correctly. 
# where are the other columns from my original dataframe?
#what if I have multiple calculated columns to add? 

gf = df.groupby(['Colour'])
gf = gf.Val.rolling(3).mean().round(0)
print(gf)

Answer 1

我非常确定transform函数可以提供帮助。

df.groupby('Col1'')['Val'].transform(lambda x: x.rolling(3, 2).mean())

其中值3是滚动窗口的步长，而2是最小周期数。

（请不要忘记在应用运行的计算之前对数据框进行排序）

通过Pandas中的windowed子句实现MSSQL的分区

1 个答案: