顺序pandas滚动数据处理

时间:2018-01-29 11:07:38

标签: pandas python-3.6 sequential

我正在使用pandas rolling-function来生成顺序数据。我的主窗口大小是51,我需要从这个初始窗口用不同的窗口计算各种度量,例如: 虚拟数据:

df = pd.DataFrame(np.random.randint(0,800,size=(1000, 3)), columns=list('ABC'))

我的功能:

def test(data):
     meanMov = np.zeros((51,3))
     mean = np.mean(data[0:31,:],axis=0)
     for i in range(0,16):
         meanMov[i] = mean
     mean = np.mean(data[20:50,:], axis=0)
     for i in range(35,51):
         meanMov[i] = mean
     for i in range(16,35):
         meanMov[i] = np.mean(data[(i-15):(i+15+1)], axis=0)
     return meanMov.mean()

运行该功能:

r = df.rolling(51)
 entr = (r.apply(test)).dropna(axis=0, how='all')

当我运行该函数时,我收到以下错误:

>>> entr =  (r.apply(test)).dropna(axis=0, how='all')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\window.py", line 1207, in apply
    return super(Rolling, self).apply(func, args=args, kwargs=kwargs)
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\window.py", line 856, in apply
    center=False)
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\window.py", line 799, in _apply
    result = np.apply_along_axis(calc, self.axis, values)
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\shape_base.py", line 116, in apply_along_axis
    res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\window.py", line 795, in calc
    closed=self.closed)
  File "C:\Users\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\window.py", line 853, in f
    offset, func, args, kwargs)
  File "pandas\_libs\window.pyx", line 1450, in pandas._libs.window.roll_generic (pandas\_libs\window.c:36061)
  File "<stdin>", line 3, in test
IndexError: too many indices for array

如何计算所有列的不同均值并保存以供进一步处理...

非常感谢!

1 个答案:

答案 0 :(得分:0)

这可能是您正在寻找的解决方案:

import pandas as pd
import numpy as np

# Create dummy data
df = pd.DataFrame(np.random.randint(0,800,size=(1000, 3)), columns=list('ABC'))

# To include this data into the dataframe with rolling means, start by creating a copy
df_complete = df.copy()

# Use the set of considered window sizes in this loop
for ws in [51, 45, 55]:
    r = df.rolling(window=ws, center=False).mean()

    # Give the following names to the columns with rolling windows: X_S, 
    # where X - name of data column and S - current window size
    r.columns = ["%s_%d" % (c, ws) for c in r.columns]

    # Add new columns to the aggregate dataframe (align using index)
    df_complete = pd.concat([df_complete, r], axis=1)

print(df_complete.sample(5))

示例输出:

       A    B    C        A_51        B_51        C_51        A_45  \
584  169  624  332  407.372549  475.333333  355.784314  405.200000   
863  477  726  218  444.980392  429.431373  458.901961  469.311111   
994  162  161  301  407.843137  415.431373  396.117647  417.155556   
873  600   82  413  445.137255  402.411765  471.490196  433.955556   
6    381  274  681         NaN         NaN         NaN         NaN   

           B_45        C_45        A_55        B_55        C_55  
584  467.622222  350.755556  409.890909  462.800000  354.490909  
863  448.777778  481.400000  449.418182  416.309091  448.563636  
994  401.555556  400.688889  405.036364  406.309091  383.454545  
873  392.822222  469.577778  454.945455  415.872727  474.327273  
6           NaN         NaN         NaN         NaN         NaN

请记住,NaN在每个列的开头显示滚动方式,其中行号小于相应的窗口大小(无法计算此类方法)。在创建NaN数据帧后,可以解决此类df_complete,例如df_complete.dropna()

关于您的代码(具体来说,test函数),我想指出根据https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.window.Rolling.apply.html,指定的函数需要&#34;从中生成单个值当您尝试返回多列的方法时,ndarray输入&#34;在我看来,没有必要为像mean()这样常见的东西创建一个自定义函数。

我尝试使用评论中建议的rolling_mean()函数:

r = pd.rolling_mean(df, window=51, center=False)

但这会产生警告,建议使用上述解决方案中的行:

pd.rolling_mean is deprecated for DataFrame and will be removed in a future version, replace with 
    DataFrame.rolling(window=51,center=False).mean()
  """Entry point for launching an IPython kernel."

我希望您能找到有用的代码和注释。