Python中的统计累加器

时间:2010-09-22 23:15:56

标签: python oop statistics accumulator

统计累加器允许执行增量计算。例如,为了计算在任意时间给出的数字流的算术平均值,可以制作一个对象,该对象跟踪给定的当前项目数n及其总和sum。当请求均值时,对象只返回sum/n

这样的累加器允许你以递增的方式计算,当给定一个新数字时,你不需要重新计算整个总和和计数。

可以为其他统计信息编写类似的累加器(参见boost library的C ++实现)。

如何在Python中实现累加器? The code I came up with是:

class Accumulator(object):
    """
    Used to accumulate the arithmetic mean of a stream of
    numbers. This implementation does not allow to remove items
    already accumulated, but it could easily be modified to do
    so. also, other statistics could be accumulated.
    """
    def __init__(self):
     # upon initialization, the numnber of items currently
     # accumulated (_n) and the total sum of the items acumulated
     # (_sum) are set to zero because nothing has been accumulated
     # yet.
     self._n = 0
     self._sum = 0.0

    def add(self, item):
     # the 'add' is used to add an item to this accumulator
     try:
        # try to convert the item to a float. If you are
        # successful, add the float to the current sum and
        # increase the number of accumulated items
        self._sum += float(item)
        self._n += 1
     except ValueError:
        # if you fail to convert the item to a float, simply
        # ignore the exception (pass on it and do nothing)
        pass

    @property
    def mean(self):
     # the property 'mean' returns the current mean accumulated in
     # the object
     if self._n > 0:
        # if you have more than zero items accumulated, then return
        # their artithmetic average
        return self._sum / self._n
     else:
        # if you have no items accumulated, return None (you could
        # also raise an exception)
        return None

# using the object:

# Create an instance of the object "Accumulator"
my_accumulator = Accumulator()
print my_accumulator.mean
# prints None because there are no items accumulated

# add one (a number)
my_accumulator.add(1)
print my_accumulator.mean
# prints 1.0

# add two (a string - it will be converted to a float)
my_accumulator.add('2')
print my_accumulator.mean
# prints 1.5

# add a 'NA' (will be ignored because it cannot be converted to float)
my_accumulator.add('NA')
print my_accumulator.mean
# prints 1.5 (notice that it ignored the 'NA')

出现了有趣的设计问题:

  1. 如何制作累加器 线程安全的?
  2. 如何安全删除 项目?
  3. 如何在某种程度上进行建筑 允许其他统计数据 插电很容易(统计工厂)

2 个答案:

答案 0 :(得分:3)

对于通用的线程安全更高级函数,您可以将以下内容与Queue.Queue类和其他位一起使用:

from Queue import Empty

def Accumulator(f, q, storage):
    """Yields successive values of `f` over the accumulation of `q`.

    `f` should take a single iterable as its parameter.

    `q` is a Queue.Queue or derivative.

    `storage` is a persistent sequence that provides an `append` method.
    `collections.deque` may be particularly useful, but a `list` is quite acceptable.

    >>> from Queue import Queue
    >>> from collections import deque
    >>> from threading import Thread
    >>> def mean(it):
    ...     vals = tuple(it)
    ...     return sum(it) / len(it)
    >>> value_queue = Queue()
    >>> LastThreeAverage = Accumulator(mean, value_queue, deque((), 3))
    >>> def add_to_queue(it, queue):
    ...     for value in it:
    ...         value_queue.put(value)
    >>> putting_thread = Thread(target=add_to_queue,
    ...                         args=(range(0, 12, 2), value_queue))
    >>> putting_thread.start()
    >>> list(LastThreeAverage)
    [0, 1, 2, 4, 6, 8]
    """
    try:
        while True:
            storage.append(q.get(timeout=0.1))
            q.task_done()
            yield f(storage)
    except Empty:
        pass

这种生成器功能通过将其委托给其他实体来逃避其所谓的责任:

  • 它依赖于Queue.Queue以线程安全的方式提供其源元素
  • collections.deque对象可以作为storage参数的值传入;除其他外,这提供了仅使用最后n(在本例中为3)值的便捷方式
  • 函数本身(在本例中为mean)作为参数传递。在某些情况下,这将导致代码效率低于最佳效果,但很容易应用于各种情况。

请注意,如果生产者线程每个值的时间超过0.1秒,则累加器可能会超时。通过传递更长的超时或完全删除超时参数可以很容易地解决这个问题。在后一种情况下,函数将无限期地阻塞在队列的末尾;在用于子线程(通常是daemon线程)的情况下,这种用法更有意义。当然,您也可以将传递给q.get的参数作为Accumulator的第四个参数进行参数化。

如果你想从生产者线程(这里是putting_thread)传达队列末尾,即没有更多的值,你可以传递并检查一个sentinel值或使用其他方法。 this thread中有更多信息;我选择编写一个名为CloseableQueue的Queue.Queue子类,它提供close方法。

您可以通过各种其他方式自定义此类函数的行为,例如通过限制队列大小;这只是一个使用示例。

修改

如上所述,由于需要重新计算,这会失去一些效率,而且我认为并不能真正回答你的问题。

生成器函数也可以通过其send方法接受值。所以你可以写一个像

这样的均值生成器函数
def meangen():
    """Yields the accumulated mean of sent values.

    >>> g = meangen()
    >>> g.send(None) # Initialize the generator
    >>> g.send(4)
    4.0
    >>> g.send(10)
    7.0
    >>> g.send(-2)
    4.0
    """
    sum = yield(None)
    count = 1
    while True:
        sum += yield(sum / float(count))
        count += 1

这里,yield表达式将值 - send的参数 - 带入函数,同时将计算出的值作为send的返回值传递。

您可以将对该函数调用返回的生成器传递给更可优化的累加器生成器函数,如下所示:

def EfficientAccumulator(g, q):
    """Similar to Accumulator but sends values to a generator `g`.

    >>> from Queue import Queue
    >>> from threading import Thread
    >>> value_queue = Queue()
    >>> g = meangen()
    >>> g.send(None)
    >>> mean_accumulator = EfficientAccumulator(g, value_queue)
    >>> def add_to_queue(it, queue):
    ...     for value in it:
    ...         value_queue.put(value)
    >>> putting_thread = Thread(target=add_to_queue,
    ...                         args=(range(0, 12, 2), value_queue))
    >>> putting_thread.start()
    >>> list(mean_accumulator)
    [0.0, 1.0, 2.0, 3.0, 4.0, 5.0]
    """
    try:
        while True:
            yield(g.send(q.get(timeout=0.1)))
            q.task_done()
    except Empty:
        pass

答案 1 :(得分:1)

如果我在Python中这样做,我会做两件事:

  1. 分离每个累加器的功能。
  2. 不以任何方式使用@property。
  3. 对于第一个,我可能想要一个用于执行累积的API,可能是这样的:

    def add(self, num) # add a number
    def compute(self) # compute the value of the accumulator
    

    然后我会创建一个AccumulatorRegistry来保存这些累加器,并允许用户调用操作并添加到所有累加器。代码可能如下所示:

    class Accumulators(object):
        _accumulator_library = {}
    
        def __init__(self):
            self.accumulator_library = {}
            for key, value in Accumulators._accumulator_library.items():
                self.accumulator_library[key] = value()
    
        @staticmethod
        def register(name, accumulator):
            Accumulators._accumulator_library[name] = accumulator
    
        def add(self, num):
            for accumulator in self.accumulator_library.values():
                accumulator.add(num)
    
        def compute(self, name):
            self.accumulator_library[name].compute()
    
        @staticmethod
        def register_decorator(name):
            def _inner(cls):
                Accumulators.register(name, cls)
                return cls
    
    
    @Accumulators.register_decorator("Mean")
    class Mean(object):
        def __init__(self):
            self.total = 0
            self.count = 0
    
        def add(self, num):
            self.count += 1
            self.total += num
    
        def compute(self):
            return self.total / float(self.count)
    

    我应该谈谈你的线程安全问题。 Python的GIL可以保护您免受许多线程问题的困扰。尽管如此,您可以采取一些措施来保护自己:

    • 如果这些对象已本地化为一个线程,请使用threading.local
    • 如果没有,您可以使用with context语法将操作包装在锁中,以便为您处理锁定。