NumPy:计算累积中位数

时间:2017-03-13 14:04:23

标签: python numpy statistics vectorization

我的样本大小= n。

我想计算numpy中每个i:1< = i< = n sample[:i]的中位数。 例如,我计算了每个i的平均值:

cummean = np.cumsum(sample) / np.arange(1, n + 1)

我可以为没有周期和理解的中位数做类似的事情吗?

5 个答案:

答案 0 :(得分:2)

使用statistics.median和累积列表理解(请注意,奇数索引包含偶数长度列表的中位数 - 其中中位数是两个中值元素的平均值,因此通常会产生小数而不是整数) :

>>> from statistics import median
>>> arr = [1, 3, 4, 2, 5, 3, 6]
>>> cum_median = [median(arr[:i+1]) for i in range(len(arr)-1)]
>>> cum_median
[1, 2.0, 3, 2.5, 3, 3.0]

答案 1 :(得分:2)

这是一种沿行复制元素的方法,为我们提供2D数组。然后,我们将用一个大数字填充上三角区域,以便稍后当我们沿着每一行对数组进行排序时,基本上将所有元素排序到对角元素并模拟累积窗口。然后,按照选择中间的median或两个中间的平均值(甚至没有元素)的定义,我们将得到第一个位置的元素:(0,0),然后第二行:(1,0) & (1,1)的平均值,第三行:(2,1),第四行:(3,1) & (3,2)的平均值,依此类推。因此,我们将从排序数组中提取出那些元素,从而得到我们的中值。

因此,实施将是 -

def cummedian_sorted(a):
    n = a.size
    maxn = a.max()+1
    a_tiled_sorted = np.tile(a,n).reshape(-1,n)
    mask = np.triu(np.ones((n,n),dtype=bool),1)

    a_tiled_sorted[mask] = maxn
    a_tiled_sorted.sort(1)

    all_rows = a_tiled_sorted[np.arange(n), np.arange(n)//2].astype(float)
    idx = np.arange(1,n,2)
    even_rows = a_tiled_sorted[idx, np.arange(1,1+(n//2))]
    all_rows[idx] += even_rows
    all_rows[1::2] /= 2.0
    return all_rows

运行时测试

方法 -

# Loopy solution from @Uriel's soln   
def cummedian_loopy(arr):
    return [median(a[:i]) for i in range(1,len(a)+1)]

# Nan-fill based solution from @Nickil Maveli's soln   
def cummedian_nanfill(arr):
    a = np.tril(arr).astype(float)
    a[np.triu_indices(a.shape[0], k=1)] = np.nan
    return np.nanmedian(a, axis=1)

计时 -

设置#1:

In [43]: a = np.random.randint(0,100,(100))

In [44]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))
    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))
    ...: 
True
True

In [45]: %timeit cummedian_loopy(a)
    ...: %timeit cummedian_nanfill(a)
    ...: %timeit cummedian_sorted(a)
    ...: 
1000 loops, best of 3: 856 µs per loop
1000 loops, best of 3: 778 µs per loop
10000 loops, best of 3: 200 µs per loop

设置#2:

In [46]: a = np.random.randint(0,100,(1000))

In [47]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))
    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))
    ...: 
True
True

In [48]: %timeit cummedian_loopy(a)
    ...: %timeit cummedian_nanfill(a)
    ...: %timeit cummedian_sorted(a)
    ...: 
10 loops, best of 3: 118 ms per loop
10 loops, best of 3: 47.6 ms per loop
100 loops, best of 3: 18.8 ms per loop

设置#3:

In [49]: a = np.random.randint(0,100,(5000))

In [50]: print np.allclose(cummedian_loopy(a), cummedian_sorted(a))
    ...: print np.allclose(cummedian_loopy(a), cummedian_nanfill(a))

True
True

In [54]: %timeit cummedian_loopy(a)
    ...: %timeit cummedian_nanfill(a)
    ...: %timeit cummedian_sorted(a)
    ...: 
1 loops, best of 3: 3.36 s per loop
1 loops, best of 3: 583 ms per loop
1 loops, best of 3: 521 ms per loop

答案 2 :(得分:2)

知道Python有一个heapq模块,可以让你保持最低的运行状态'对于可迭代,我在heapqmedian上进行了搜索,并找到了steaming medium的各种项目。这一个:

http://www.ardendertat.com/2011/11/03/programming-interview-questions-13-median-of-integer-stream/

有一个class streamMedian维护两个heapq,一个包含值的下半部分,另一个包含上半部分。中位数是' top'一个或两个值的均值。该类具有insert方法和getMedian方法。大部分工作都在insert

我将其复制到Ipython会话中,并定义:

def cummedian_stream(b):
    S=streamMedian()
    ret = []
    for item in b:
        S.insert(item)
        ret.append(S.getMedian())
    return np.array(ret)

测试:

In [155]: a = np.random.randint(0,100,(5000))
In [156]: amed = cummedian_stream(a)
In [157]: np.allclose(cummedian_sorted(a), amed)
Out[157]: True
In [158]: timeit cummedian_sorted(a)
1 loop, best of 3: 781 ms per loop
In [159]: timeit cummedian_stream(a)
10 loops, best of 3: 39.6 ms per loop

heapq流方法更快。

@Uriel给出的列表理解相对较慢。但是,如果我将np.median替换为statistics.median,则它比@Divakar's排序解决方案更快:

def fastloop(a):
    return np.array([np.median(a[:i+1]) for i in range(len(a))])

In [161]: timeit fastloop(a)
1 loop, best of 3: 360 ms per loop

@Paul Panzer's分区方法也很好,但与流媒体类相比仍然很慢。

In [165]: timeit cummedian_partition(a)
1 loop, best of 3: 391 ms per loop

(如果需要,我可以将streamMedian课程复制到此答案。)

答案 3 :(得分:1)

是否有迟到的空间?

def cummedian_partition(a):
    n = len(a)
    assert n%4 == 0 # for simplicity
    mn = a.min() - 1
    mx = a.max() + 1
    h = n//2
    N = n + h//2
    work = np.empty((h, N), a.dtype)
    work[:, :n] = a
    work[:, n] = 2*mn - a[0]
    i, j = np.tril_indices(h, -1)
    work[i, n-1-j] = (2*mn - a[1:h+1])[j]
    k, l = np.ogrid[:h, :h//2 - 1]
    work[:, n+1:] = np.where(k > 2*l+1, mx, 2 * mn - mx)
    out = np.partition(work, (N-n//2-1, N-n//2, h//2-1, h//2), axis=-1)
    out = np.r_[2*mn-out[:, h//2: h//2-2:-1], out[::-1, N-n//2-1:N-n//2+1]]
    out[::2, 0] = out[::2, 1]
    return np.mean(out, axis=-1)

该算法使用具有线性复杂度的分区。由于np.partition不支持每线分割点,因此需要一些体操。所需的复杂性和内存是二次的。

与目前的最佳时间相比:

for j in (100, 1000, 5000):
    a = np.random.randint(0, 100, (j,))
    print('size', j)
    print('correct', np.allclose(cummedian_partition(a), cummedian_sorted(a)))
    print('Divakar', timeit(lambda: cummedian_sorted(a), number=10))
    print('PP', timeit(lambda: cummedian_partition(a), number=10))

#  size 100
#  correct True
#  Divakar 0.0022412699763663113
#  PP 0.002393342030700296
#  size 1000
#  correct True
#  Divakar 0.20881508802995086
#  PP 0.10222102201078087
#  size 5000
#  correct True
#  Divakar 6.158387024013791
#  PP 3.437395485001616

答案 4 :(得分:1)

有一个近似的解决方案。如果您将值列表arr视为概率质量函数。您可以使用np.cumsum(arr)来获取累积分布函数。根据定义,中位数仅占概率的一半。 给您一个近似的解决方案

arr[np.searchsorted(np.cumsum(arr), np.cumsum(arr)/2)]