C中的滚动中值算法

时间:2009-08-20 22:50:44

标签: c algorithm r statistics median

我目前正致力于在C中实现滚动中值滤波器(类似于滚动均值滤波器)的算法。从我对文献的研究中,似乎有两种合理有效的方法。第一种是对值的初始窗口进行排序,然后执行二进制搜索以插入新值并在每次迭代时删除现有值。

第二个(来自Hardle和Steiger,1995,JRSS-C,算法296)构建了一个双端堆结构,一端是maxheap,另一端是minheap,中间是中间值。这产生线性时间算法而不是O(n log n)。

这是我的问题:实现前者是可行的,但我需要在数百万个时间序列中运行它,因此效率很重要。后者证明非常难以实施。我在R的stats包的代码的Trunmed.c文件中找到了代码,但它是难以理解的。

有没有人知道线性时间滚动中值算法的编写良好的C实现?

编辑:链接到Trunmed.c代码http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#mYw3h_Lb_e0/R-2.2.0/src/library/stats/src/Trunmed.c

13 个答案:

答案 0 :(得分:25)

我已经看了几次R src/library/stats/src/Trunmed.c,因为我想在一个独立的C ++ class / C子例程中也有类似的东西。请注意,这实际上是两个实现中的一个,请参阅src/library/stats/man/runmed.Rd(帮助文件的来源),其中包含

\details{
  Apart from the end values, the result \code{y = runmed(x, k)} simply has
  \code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
  efficiently.

  The two algorithms are internally entirely different:
  \describe{
    \item{"Turlach"}{is the Härdle-Steiger
      algorithm (see Ref.) as implemented by Berwin Turlach.
      A tree algorithm is used, ensuring performance \eqn{O(n \log
        k)}{O(n * log(k))} where \code{n <- length(x)} which is
      asymptotically optimal.}
    \item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
      which makes use of median \emph{updating} when one observation
      enters and one leaves the smoothing window.  While this performs as
      \eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
      considerably faster for small \eqn{k} or \eqn{n}.}
  }
}

很高兴看到这种方式以更独立的方式重复使用。你是志愿者吗?我可以帮助一些R位。

编辑1 :除了上面的旧版Trunmed.c的链接,这里有当前的SVN副本

编辑2 :Ryan Tibshirani在fast median binning上有一些C和Fortran代码,这可能是窗口方法的合适起点。

答案 1 :(得分:16)

我无法找到带有顺序统计的c ++数据结构的现代实现,因此最终在MAK建议的顶级编码器链接中实现了这两个想法(Match Editorial:向下滚动到FloatingMedian)。

两个多字节

第一个想法将数据划分为两个数据结构(堆,多重集等),每次插入/删除O(ln N)不允许动态更改分位数而无需大量成本。即我们可以有一个滚动中位数,或者滚动75%但不能同时滚动。

细分树

第二种想法使用O(ln N)的分段树进行插入/删除/查询,但更灵活。最重要的是“N”是您的数据范围的大小。因此,如果你的滚动中位数有一百万个项目的窗口,但你的数据从1..65536变化,那么滚动窗口的每次移动只需要16次操作!!

c ++代码类似于Denis上面发布的内容(“这是一个简单的量化数据算法”)

GNU订单统计树

在放弃之前,我发现stdlibc ++包含订单统计树!!!

这些有两个关键操作:

iter = tree.find_by_order(value)
order = tree.order_of_key(value)

请参阅libstdc++ manual policy_based_data_structures_test(搜索“拆分和加入”)。

我已将这个树包装在一个方便标题中,用于支持c ++ 0x / c ++ 11样式的部分typedef的编译器:

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H

答案 2 :(得分:14)

我做了C implementation here。这个问题还有一些细节:Rolling median in C - Turlach implementation

样本用法:

int main(int argc, char* argv[])
{
   int i,v;
   Mediator* m = MediatorNew(15);

   for (i=0;i<30;i++)
   {
      v = rand()&127;
      printf("Inserting %3d \n",v);
      MediatorInsert(m,v);
      v=MediatorMedian(m);
      printf("Median = %3d.\n\n",v);
      ShowTree(m);
   }
}

答案 3 :(得分:8)

这是量化数据的简单算法(几个月后):

""" median1.py: moving median 1d for quantized, e.g. 8-bit data

Method: cache the median, so that wider windows are faster.
    The code is simple -- no heaps, no trees.

Keywords: median filter, moving median, running median, numpy, scipy

See Perreault + Hebert, Median Filtering in Constant Time, 2007,
    http://nomis80.org/ctmf.html: nice 6-page paper and C code,
    mainly for 2d images

Example:
    y = medians( x, window=window, nlevel=nlevel )
    uses:
    med = Median1( nlevel, window, counts=np.bincount( x[0:window] ))
    med.addsub( +, - )  -- see the picture in Perreault
    m = med.median()  -- using cached m, summ

How it works:
    picture nlevel=8, window=3 -- 3 1s in an array of 8 counters:
        counts: . 1 . . 1 . 1 .
        sums:   0 1 1 1 2 2 3 3
                        ^ sums[3] < 2 <= sums[4] <=> median 4
        addsub( 0, 1 )  m, summ stay the same
        addsub( 5, 1 )  slide right
        addsub( 5, 6 )  slide left

Updating `counts` in an `addsub` is trivial, updating `sums` is not.
But we can cache the previous median `m` and the sum to m `summ`.
The less often the median changes, the faster;
so fewer levels or *wider* windows are faster.
(Like any cache, run time varies a lot, depending on the input.)

See also:
    scipy.signal.medfilt -- runtime roughly ~ window size
    http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c

"""

from __future__ import division
import numpy as np  # bincount, pad0

__date__ = "2009-10-27 oct"
__author_email__ = "denis-bz-py at t-online dot de"


#...............................................................................
class Median1:
    """ moving median 1d for quantized, e.g. 8-bit data """

    def __init__( s, nlevel, window, counts ):
        s.nlevel = nlevel  # >= len(counts)
        s.window = window  # == sum(counts)
        s.half = (window // 2) + 1  # odd or even
        s.setcounts( counts )

    def median( s ):
        """ step up or down until sum cnt to m-1 < half <= sum to m """
        if s.summ - s.cnt[s.m] < s.half <= s.summ:
            return s.m
        j, sumj = s.m, s.summ
        if sumj <= s.half:
            while j < s.nlevel - 1:
                j += 1
                sumj += s.cnt[j]
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        else:
            while j > 0:
                sumj -= s.cnt[j]
                j -= 1
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        s.m, s.summ = j, sumj
        return s.m

    def addsub( s, add, sub ):
        s.cnt[add] += 1
        s.cnt[sub] -= 1
        assert s.cnt[sub] >= 0, (add, sub)
        if add <= s.m:
            s.summ += 1
        if sub <= s.m:
            s.summ -= 1

    def setcounts( s, counts ):
        assert len(counts) <= s.nlevel, (len(counts), s.nlevel)
        if len(counts) < s.nlevel:
            counts = pad0__( counts, s.nlevel )  # numpy array / list
        sumcounts = sum(counts)
        assert sumcounts == s.window, (sumcounts, s.window)
        s.cnt = counts
        s.slowmedian()

    def slowmedian( s ):
        j, sumj = -1, 0
        while sumj < s.half:
            j += 1
            sumj += s.cnt[j]
        s.m, s.summ = j, sumj

    def __str__( s ):
        return ("median %d: " % s.m) + \
            "".join([ (" ." if c == 0 else "%2d" % c) for c in s.cnt ])

#...............................................................................
def medianfilter( x, window, nlevel=256 ):
    """ moving medians, y[j] = median( x[j:j+window] )
        -> a shorter list, len(y) = len(x) - window + 1
    """
    assert len(x) >= window, (len(x), window)
    # np.clip( x, 0, nlevel-1, out=x )
        # cf http://scipy.org/Cookbook/Rebinning
    cnt = np.bincount( x[0:window] )
    med = Median1( nlevel=nlevel, window=window, counts=cnt )
    y = (len(x) - window + 1) * [0]
    y[0] = med.median()
    for j in xrange( len(x) - window ):
        med.addsub( x[j+window], x[j] )
        y[j+1] = med.median()
    return y  # list
    # return np.array( y )

def pad0__( x, tolen ):
    """ pad x with 0 s, numpy array or list """
    n = tolen - len(x)
    if n > 0:
        try:
            x = np.r_[ x, np.zeros( n, dtype=x[0].dtype )]
        except NameError:
            x += n * [0]
    return x

#...............................................................................
if __name__ == "__main__":
    Len = 10000
    window = 3
    nlevel = 256
    period = 100

    np.set_printoptions( 2, threshold=100, edgeitems=10 )
    # print medians( np.arange(3), 3 )

    sinwave = (np.sin( 2 * np.pi * np.arange(Len) / period )
        + 1) * (nlevel-1) / 2
    x = np.asarray( sinwave, int )
    print "x:", x
    for window in ( 3, 31, 63, 127, 255 ):
        if window > Len:  continue
        print "medianfilter: Len=%d window=%d nlevel=%d:" % (Len, window, nlevel)
            y = medianfilter( x, window=window, nlevel=nlevel )
        print np.array( y )

# end median1.py

答案 4 :(得分:8)

我使用这个增量中位数估算器:

median += eta * sgn(sample - median)

与更常见的平均估计量具有相同的形式:

mean += eta * (sample - mean)

此处 eta 是一个较小的学习率参数(例如0.001),sgn()是返回{-1, 0, 1}之一的signum函数。 (如果数据是非平稳的并且您想跟踪随时间的变化,请使用常量eta;否则,对于固定来源,请使用类似eta = 1 / n的内容进行收敛,其中n为到目前为止看到的样本数量。)

另外,我修改了中值估计量,使其适用于任意分位数。通常,quantile function会告诉您将数据划分为两个分数的值:p1 - p。以下内容逐步估算此值:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

p应在[0, 1]之内。这实质上将sgn()函数的对称输出{-1, 0, 1}向一侧倾斜,将数据样本划分为两个不等大小的二进制位(数据的分数p1 - p为分别小于/大于分位数估计值)。请注意,对于p = 0.5,这会降低到中位数估算值。

答案 5 :(得分:4)

通过维护两个数字分区可以找到滚动中位数。

对于维护分区,请使用Min Heap和Max Heap。

Max Heap将包含小于等于中位数的数字。

Min Heap将包含大于等于中位数的数字。

平衡约束:    如果元素的总数是偶数,那么两个堆应该具有相同的元素。

如果元素的总数是奇数,那么Max Heap将比Min Heap多一个元素。

中位元素:如果两个分区的元素数量相等,那么中位数将是第一个分区的最大元素和第二个分区的最小元素之和的一半。

否则中位数将是第一个分区的最大元素。

Algorithm-
1- Take two Heap(1 Min Heap and 1 Max Heap)
   Max Heap will contain first half number of elements
   Min Heap will contain second half number of elements

2- Compare new number from stream with top of Max Heap, 
   if it is smaller or equal add that number in max heap. 
   Otherwise add number in Min Heap.

3- if min Heap has more elements than Max Heap 
   then remove top element of Min Heap and add in Max Heap.
   if max Heap has more than one element than in Min Heap 
   then remove top element of Max Heap and add in Min Heap.

4- If Both heaps has equal number of elements then
   median will be half of sum of max element from Max Heap and min element from Min Heap.
   Otherwise median will be max element from the first partition.
private int rotate;
private int width;
private int height;

private int color = Color.rgb(226, 255, 243);

private Paint paint = new Paint();

public Player(){
    rotate = 0;
    width = Constants.width / 10;
    height = Constants.height / 50;
    x = Constants.width / 2 - width / 2;
    y = Constants.height / 2 + Constants.height / 8;
}

public void setRotate(int rotate){
    this.rotate = rotate;
}
public int getRotate(){return rotate;}

public int getX(){return x;}
public int getY(){return y;}
public int getWidth(){return width;}
public int getHeight(){return height;}

public void update(){
    if(rotate >= 360){
        rotate = 0;
    }
}

public void draw(Canvas canvas){
    paint.setColor(color);

    canvas.save();
    canvas.rotate(rotate, Constants.width / 2, Constants.height / 2);
    canvas.drawRect(x, y, x + width, y + height, paint);
    canvas.restore();
}

答案 6 :(得分:1)

如果您能够将值作为时间点的函数进行参考,则可以使用替换值对值进行采样,应用bootstrapping以在置信区间内生成自举中值。这可以让您比不断将输入值排序到数据结构中更有效地计算近似中值。

答案 7 :(得分:1)

对于那些需要在Java中运行中位数的人... PriorityQueue是你的朋友。 O(log N)插入,O(1)电流中值和O(N)去除。如果您知道数据的分布,那么您可以做得比这更好。

public class RunningMedian {
  // Two priority queues, one of reversed order.
  PriorityQueue<Integer> lower = new PriorityQueue<Integer>(10,
          new Comparator<Integer>() {
              public int compare(Integer arg0, Integer arg1) {
                  return (arg0 < arg1) ? 1 : arg0 == arg1 ? 0 : -1;
              }
          }), higher = new PriorityQueue<Integer>();

  public void insert(Integer n) {
      if (lower.isEmpty() && higher.isEmpty())
          lower.add(n);
      else {
          if (n <= lower.peek())
              lower.add(n);
          else
              higher.add(n);
          rebalance();
      }
  }

  void rebalance() {
      if (lower.size() < higher.size() - 1)
          lower.add(higher.remove());
      else if (higher.size() < lower.size() - 1)
          higher.add(lower.remove());
  }

  public Integer getMedian() {
      if (lower.isEmpty() && higher.isEmpty())
          return null;
      else if (lower.size() == higher.size())
          return (lower.peek() + higher.peek()) / 2;
      else
          return (lower.size() < higher.size()) ? higher.peek() : lower
                  .peek();
  }

  public void remove(Integer n) {
      if (lower.remove(n) || higher.remove(n))
          rebalance();
  }
}

答案 8 :(得分:1)

可能值得指出的是,有一个特殊情况有一个简单的精确解决方案:当流中的所有值都是(相对)小的定义范围内的整数时。例如,假设它们必须都位于0和1023之间。在这种情况下,只需定义一个包含1024个元素和一个计数的数组,并清除所有这些值。对于流中的每个值,增加相应的bin和计数。在流结束之后找到包含count / 2最高值的bin - 通过从0开始添加连续的bin很容易实现。使用相同的方法可以找到任意排序的值。 (如果检测到纸箱饱和度并且&#34;升级&#34;在运行期间将存储箱的大小设置为更大的类型,则会出现轻微的复杂情况。)

这种特殊情况可能看似人为,但在实践中它很常见。它也可以作为实数的近似值,如果它们位于一个范围内并且足够好&#34;精度水平是已知的。这将适用于一组真实世界中的任何一组测量。对象。例如,一群人的身高或体重。设置不够大?它对于地球上所有(个体)细菌的长度或重量也同样有效 - 假设有人可以提供数据!

看起来我误读了原作 - 看起来它想要一个滑动窗口中位数而不是一个非常长的流的中位数。这种方法仍然有效。加载初始窗口的前N个流值,然后对于第N + 1个流值增加相应的bin,同时递减对应于第0个流值的bin。在这种情况下,有必要保留最后的N个值以允许递减,这可以通过循环寻址大小为N的数组来有效地完成。由于中位数的位置只能改变-2,-1,0,1 ,2,在滑动窗口的每一步中,不需要将每个步骤中的所有二进制数加到中位数,只需调整&#34;中值指针&#34;取决于哪些侧箱被修改。例如,如果新值和被删除的值都低于当前中值,则它不会改变(偏移= 0)。当N变得太大而无法在内存中方便地保存时,该方法会崩溃。

答案 9 :(得分:0)

这是一个可以在精确输出不重要时使用的(用于显示目的等) 你需要totalcount和lastmedian,以及newvalue。

{
totalcount++;
newmedian=lastmedian+(newvalue>lastmedian?1:-1)*(lastmedian==0?newvalue: lastmedian/totalcount*2);
}

为page_display_time等内容生成非常精确的结果。

规则:输入流需要在页面显示时间的顺序上平滑,计数大(> 30等),并且具有非零中位数。

实施例: 页面加载时间,800项,10ms ... 3000ms,平均90ms,实际中位数:11ms

在输入30后,中位数误差通常<= 20%(9ms..12ms),并且越来越少。 输入800后,错误为+ -2%。

另一位有类似解决方案的思想家在这里:Median Filter Super efficient implementation

答案 10 :(得分:-1)

这是java实现

package MedianOfIntegerStream;

import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;


public class MedianOfIntegerStream {

    public Set<Integer> rightMinSet;
    public Set<Integer> leftMaxSet;
    public int numOfElements;

    public MedianOfIntegerStream() {
        rightMinSet = new TreeSet<Integer>();
        leftMaxSet = new TreeSet<Integer>(new DescendingComparator());
        numOfElements = 0;
    }

    public void addNumberToStream(Integer num) {
        leftMaxSet.add(num);

        Iterator<Integer> iterMax = leftMaxSet.iterator();
        Iterator<Integer> iterMin = rightMinSet.iterator();
        int maxEl = iterMax.next();
        int minEl = 0;
        if (iterMin.hasNext()) {
            minEl = iterMin.next();
        }

        if (numOfElements % 2 == 0) {
            if (numOfElements == 0) {
                numOfElements++;
                return;
            } else if (maxEl > minEl) {
                iterMax.remove();

                if (minEl != 0) {
                    iterMin.remove();
                }
                leftMaxSet.add(minEl);
                rightMinSet.add(maxEl);
            }
        } else {

            if (maxEl != 0) {
                iterMax.remove();
            }

            rightMinSet.add(maxEl);
        }
        numOfElements++;
    }

    public Double getMedian() {
        if (numOfElements % 2 != 0)
            return new Double(leftMaxSet.iterator().next());
        else
            return (leftMaxSet.iterator().next() + rightMinSet.iterator().next()) / 2.0;
    }

    private class DescendingComparator implements Comparator<Integer> {
        @Override
        public int compare(Integer o1, Integer o2) {
            return o2 - o1;
        }
    }

    public static void main(String[] args) {
        MedianOfIntegerStream streamMedian = new MedianOfIntegerStream();

        streamMedian.addNumberToStream(1);
        System.out.println(streamMedian.getMedian()); // should be 1

        streamMedian.addNumberToStream(5);
        streamMedian.addNumberToStream(10);
        streamMedian.addNumberToStream(12);
        streamMedian.addNumberToStream(2);
        System.out.println(streamMedian.getMedian()); // should be 5

        streamMedian.addNumberToStream(3);
        streamMedian.addNumberToStream(8);
        streamMedian.addNumberToStream(9);
        System.out.println(streamMedian.getMedian()); // should be 6.5
    }
}

答案 11 :(得分:-1)

基于@mathog的思想,这是C#实现,用于在具有已知值范围的字节数组上运行中值。可以扩展为其他整数类型。

  /// <summary>
  /// Median estimation by histogram, avoids multiple sorting operations for a running median
  /// </summary>
  public class MedianEstimator
  {
    private readonly int m_size2;
    private readonly byte[] m_counts;

    /// <summary>
    /// Estimated median, available right after calling <see cref="Init"/> or <see cref="Update"/>.
    /// </summary>
    public byte Median { get; private set; }

    /// <summary>
    /// Ctor
    /// </summary>
    /// <param name="size">Median size in samples</param>
    /// <param name="maxValue">Maximum expected value in input data</param>
    public MedianEstimator(
      int size,
      byte maxValue)
    {
      m_size2 = size / 2;
      m_counts = new byte[maxValue + 1];
    }

    /// <summary>
    /// Initializes the internal histogram with the passed sample values
    /// </summary>
    /// <param name="values">Array of values, usually the start of the array for a running median</param>
    public void Init(byte[] values)
    {
      for (var i = 0; i < values.Length; i++)
        m_counts[values[i]]++;

      UpdateMedian();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private void UpdateMedian()
    {
      // The median is the first value up to which counts add to size / 2
      var sum = 0;
      Median = 0;
      for (var i = 0; i < m_counts.Length; i++)
      {
        sum += m_counts[i];
        Median = (byte) i;
        if (sum > m_size2) break;
      }
    }

    /// <summary>
    /// Updates the median estimation by removing <paramref name="last"/> and adding <paramref name="next"/>. These
    /// values must be updated as the running median is applied. If the median length is <i>N</i>, at the sample
    /// <i>i</i>, <paramref name="last"/> is sample at index <i>i</i>-<i>N</i>/2 and <paramref name="next"/> is sample
    /// at index <i>i</i>+<i>N</i>/2+1.
    /// </summary>
    /// <param name="last">Sample at the start of the moving window that is to be removed</param>
    /// <param name="next">Sample at the end of the moving window + 1 that is to be added</param>
    public void Update(byte last, byte next)
    {
      m_counts[last]--;
      m_counts[next]++;

      // The conditions below do not change median value so there is no need to update it
      if (last == next ||
          last < Median && next < Median || // both below median
          last > Median && next > Median) // both above median
        return;

      UpdateMedian();
    }

针对正在运行的中位数进行计时,并进行计时:

private void TestMedianEstimator()
{
  var r = new Random();
  const int SIZE = 15;
  const byte MAX_VAL = 80;
  var values = new byte[100000];
  for (var i = 0; i < values.Length; i++)
    values[i] = (byte) (MAX_VAL * r.NextDouble());

  var timer = Stopwatch.StartNew();
  // Running median
  var window = new byte[2 * SIZE + 1];
  var medians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    for (int j = i - SIZE, k = 0; j <= i + SIZE; j++, k++)
      window[k] = values[j];
    Array.Sort(window);
    medians[i] = window[SIZE];
  }
  timer.Stop();
  var elapsed1 = timer.Elapsed;

  timer.Restart();
  var me = new MedianEstimator(2 * SIZE + 1, MAX_VAL);
  me.Init(values.Slice(0, 2 * SIZE + 1));
  var meMedians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    meMedians[i] = me.Median;
    me.Update(values[i - SIZE], values[i + SIZE + 1]);
  }
  timer.Stop();
  var elapsed2 = timer.Elapsed;

  WriteLineToLog($"{elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds:0.00}");

  var diff = 0;
  for (var i = 0; i < meMedians.Length; i++)
    diff += Math.Abs(meMedians[i] - medians[i]);
  WriteLineToLog($"Diff: {diff}");
}

答案 12 :(得分:-4)

如果你只需要平滑的平均值,快速/简单的方法是将最新值乘以x,将平均值乘以(1-x)然后加上它们。这将成为新的平均值。

编辑:不是用户要求的,而不是统计上有效的,但足以满足大量用途 我会把它留在这里(尽管有支票),但我会把它留在这里搜索!