Question

我借用了一些代码试图实现一个函数来计算大量数据的运行中位数。目前对我来说太慢了（棘手的部分是我需要从正在运行的框中排除所有零）。以下是代码：

from itertools import islice
from collections import deque
from bisect import bisect_left,insort

def median(s):
    sp = [nz for nz in s if nz!=0]
    print sp
    Mnow = len(sp)
    if Mnow == 0:
        return 0
    else:
        return np.median(sp)

def RunningMedian(seq, M):
    seq = iter(seq)
    s = []

    # Set up list s (to be sorted) and load deque with first window of seq
    s = [item for item in islice(seq,M)]
    d = deque(s)

    # Sort it in increasing order and extract the median ("center" of the sorted window)
    s.sort()
    medians = [median(s)]
    for item in seq:
        old = d.popleft()          # pop oldest from left
        d.append(item)             # push newest in from right
        del s[bisect_left(s, old)] # locate insertion point and then remove old 
        insort(s, item)            # insert newest such that new sort is not required        
        medians.append(median(s))
    return medians

效果很好，唯一的缺点就是它太慢了。任何人都可以帮助我提高代码效率？感谢。

在我探索了所有可能性之后，以下简单代码可以相对有效地计算：

def RunningMedian(x,N):
    idx = np.arange(N) + np.arange(len(x)-N+1)[:,None]
    b = [row[row>0] for row in x[idx]]
    return np.array(map(np.median,b))
    #return np.array([np.median(c) for c in b])  # This also works

感谢@Divakar。

Answer 1

以下是一种方法：

def RunningMedian(x,N):
    idx = np.arange(N) + np.arange(len(x)-N+1)[:,None]
    b = [row[row>0] for row in x[idx]]
    return np.array(map(np.median,b))
    #return np.array([np.median(c) for c in b])  # This also works

我找到了一个更快的（快几万倍），复制如下：

from collections import deque
from bisect import insort, bisect_left
from itertools import islice
def running_median_insort(seq, window_size):
    """Contributed by Peter Otten"""
    seq = iter(seq)
    d = deque()
    s = []
    result = []
    for item in islice(seq, window_size):
        d.append(item)
        insort(s, item)
        result.append(s[len(d)//2])
    m = window_size // 2
    for item in seq:
        old = d.popleft()
        d.append(item)
        del s[bisect_left(s, old)]
        insort(s, item)
        result.append(s[m])
    return result

查看链接：running_median

Answer 2

一种方法是在处理数据样本时使用两个heaps来维护数据样本的上半部分和下半部分。算法：对于每个值，将其放入适当的堆中并“重新平衡”堆，以使它们的大小之差不超过1。然后，要获取中位数，只需从较大的堆中取出第一个元素或取平均值即可两个堆的第一个元素。此解决方案的时间复杂度为O(n log(n))。

from heapq import heappush, heappop


def add_number(number, lowers, highers):
    if not highers or number > highers[0]:
        heappush(highers, number)
    else:
        heappush(lowers, -number)  # for lowers we need a max heap


def rebalance(lowers, highers):
    if len(lowers) - len(highers) > 1:
        heappush(highers, -heappop(lowers))
    elif len(highers) - len(lowers) > 1:
        heappush(lowers, -heappop(highers))


def get_median(lowers, highers):
    if len(lowers) == len(highers):
        return (-lowers[0] + highers[0])/2
    elif len(lowers) > len(highers):
        return -lowers[0]
    else:
        return highers[0]


lowers, highers = [], []

for i in [12, 4, 5, 3, 8, 7]:
    add_number(i, lowers, highers)
    rebalance(lowers, highers)
    print(get_median(lowers, highers))

如何有效地计算运行中位数

2 个答案: