我想知道对于中位数是否有任何相当于cumsum()
或cummax()
等的pandas:例如cummedian()
。
如果我有,例如这个数据帧:
a
1 5
2 7
3 6
4 4
我想要的是:
df['a'].cummedian()
应输出:
5
6
6
5.5
答案 0 :(得分:4)
您可以使用expanding.median
-
df.a.expanding().median()
1 5.0
2 6.0
3 6.0
4 5.5
Name: a, dtype: float64
<强>计时强>
df = pd.DataFrame({'a' : np.arange(1000000)})
%timeit df['a'].apply(cummedian())
1 loop, best of 3: 1.69 s per loop
%timeit df.a.expanding().median()
1 loop, best of 3: 838 ms per loop
获胜者是expanding.median
。 Divakar的方法是内存密集型的,并且在这个大小的输入下遭受内存井喷。
答案 1 :(得分:2)
我们可以创建纳米填充的子阵列作为具有基于strides
的函数的行,就像这样 -
def nan_concat_sliding_windows(x):
n = len(x)
add_arr = np.full(n-1, np.nan)
x_ext = np.concatenate((add_arr, x))
strided = np.lib.stride_tricks.as_strided
nrows = len(x_ext)-n+1
s = x_ext.strides[0]
return strided(x_ext, shape=(nrows,n), strides=(s,s))
示例运行 -
In [56]: x
Out[56]: array([5, 6, 7, 4])
In [57]: nan_concat_sliding_windows(x)
Out[57]:
array([[ nan, nan, nan, 5.],
[ nan, nan, 5., 6.],
[ nan, 5., 6., 7.],
[ 5., 6., 7., 4.]])
因此,要获得数组x
的滑动中值,我们会得到一个矢量化解,就像这样 -
np.nanmedian(nan_concat_sliding_windows(x), axis=1)
因此,最终的解决方案是 -
In [54]: df
Out[54]:
a
1 5
2 7
3 6
4 4
In [55]: pd.Series(np.nanmedian(nan_concat_sliding_windows(df.a.values), axis=1))
Out[55]:
0 5.0
1 6.0
2 6.0
3 5.5
dtype: float64
答案 2 :(得分:0)
特定累积中位数的更快解决方案
In [1]: import timeit
In [2]: setup = """import bisect
...: import pandas as pd
...: def cummedian():
...: l = []
...: info = [0, True]
...: def inner(n):
...: bisect.insort(l, n)
...: info[0] += 1
...: info[1] = not info[1]
...: median = info[0] // 2
...: if info[1]:
...: return (l[median] + l[median - 1]) / 2
...: else:
...: return l[median]
...: return inner
...: df = pd.DataFrame({'a': range(20)})"""
In [3]: timeit.timeit("df['cummedian'] = df['a'].apply(cummedian())",setup=setup,number=100000)
Out[3]: 27.11604686321956
In [4]: timeit.timeit("df['expanding'] = df['a'].expanding().median()",setup=setup,number=100000)
Out[4]: 48.457676260100335
In [5]: 48.4576/27.116
Out[5]: 1.7870482372031273