统计累加器允许执行增量计算。例如,为了计算在任意时间给出的数字流的算术平均值,可以制作一个对象,该对象跟踪给定的当前项目数n
及其总和sum
。当请求均值时,对象只返回sum/n
。
这样的累加器允许你以递增的方式计算,当给定一个新数字时,你不需要重新计算整个总和和计数。
可以为其他统计信息编写类似的累加器(参见boost library的C ++实现)。
如何在Python中实现累加器? The code I came up with是:
class Accumulator(object):
"""
Used to accumulate the arithmetic mean of a stream of
numbers. This implementation does not allow to remove items
already accumulated, but it could easily be modified to do
so. also, other statistics could be accumulated.
"""
def __init__(self):
# upon initialization, the numnber of items currently
# accumulated (_n) and the total sum of the items acumulated
# (_sum) are set to zero because nothing has been accumulated
# yet.
self._n = 0
self._sum = 0.0
def add(self, item):
# the 'add' is used to add an item to this accumulator
try:
# try to convert the item to a float. If you are
# successful, add the float to the current sum and
# increase the number of accumulated items
self._sum += float(item)
self._n += 1
except ValueError:
# if you fail to convert the item to a float, simply
# ignore the exception (pass on it and do nothing)
pass
@property
def mean(self):
# the property 'mean' returns the current mean accumulated in
# the object
if self._n > 0:
# if you have more than zero items accumulated, then return
# their artithmetic average
return self._sum / self._n
else:
# if you have no items accumulated, return None (you could
# also raise an exception)
return None
# using the object:
# Create an instance of the object "Accumulator"
my_accumulator = Accumulator()
print my_accumulator.mean
# prints None because there are no items accumulated
# add one (a number)
my_accumulator.add(1)
print my_accumulator.mean
# prints 1.0
# add two (a string - it will be converted to a float)
my_accumulator.add('2')
print my_accumulator.mean
# prints 1.5
# add a 'NA' (will be ignored because it cannot be converted to float)
my_accumulator.add('NA')
print my_accumulator.mean
# prints 1.5 (notice that it ignored the 'NA')
出现了有趣的设计问题:
答案 0 :(得分:3)
对于通用的线程安全更高级函数,您可以将以下内容与Queue.Queue
类和其他位一起使用:
from Queue import Empty
def Accumulator(f, q, storage):
"""Yields successive values of `f` over the accumulation of `q`.
`f` should take a single iterable as its parameter.
`q` is a Queue.Queue or derivative.
`storage` is a persistent sequence that provides an `append` method.
`collections.deque` may be particularly useful, but a `list` is quite acceptable.
>>> from Queue import Queue
>>> from collections import deque
>>> from threading import Thread
>>> def mean(it):
... vals = tuple(it)
... return sum(it) / len(it)
>>> value_queue = Queue()
>>> LastThreeAverage = Accumulator(mean, value_queue, deque((), 3))
>>> def add_to_queue(it, queue):
... for value in it:
... value_queue.put(value)
>>> putting_thread = Thread(target=add_to_queue,
... args=(range(0, 12, 2), value_queue))
>>> putting_thread.start()
>>> list(LastThreeAverage)
[0, 1, 2, 4, 6, 8]
"""
try:
while True:
storage.append(q.get(timeout=0.1))
q.task_done()
yield f(storage)
except Empty:
pass
这种生成器功能通过将其委托给其他实体来逃避其所谓的责任:
Queue.Queue
以线程安全的方式提供其源元素collections.deque
对象可以作为storage
参数的值传入;除其他外,这提供了仅使用最后n
(在本例中为3)值的便捷方式mean
)作为参数传递。在某些情况下,这将导致代码效率低于最佳效果,但很容易应用于各种情况。请注意,如果生产者线程每个值的时间超过0.1秒,则累加器可能会超时。通过传递更长的超时或完全删除超时参数可以很容易地解决这个问题。在后一种情况下,函数将无限期地阻塞在队列的末尾;在用于子线程(通常是daemon
线程)的情况下,这种用法更有意义。当然,您也可以将传递给q.get
的参数作为Accumulator
的第四个参数进行参数化。
如果你想从生产者线程(这里是putting_thread
)传达队列末尾,即没有更多的值,你可以传递并检查一个sentinel值或使用其他方法。 this thread中有更多信息;我选择编写一个名为CloseableQueue的Queue.Queue子类,它提供close
方法。
您可以通过各种其他方式自定义此类函数的行为,例如通过限制队列大小;这只是一个使用示例。
如上所述,由于需要重新计算,这会失去一些效率,而且我认为并不能真正回答你的问题。
生成器函数也可以通过其send
方法接受值。所以你可以写一个像
def meangen():
"""Yields the accumulated mean of sent values.
>>> g = meangen()
>>> g.send(None) # Initialize the generator
>>> g.send(4)
4.0
>>> g.send(10)
7.0
>>> g.send(-2)
4.0
"""
sum = yield(None)
count = 1
while True:
sum += yield(sum / float(count))
count += 1
这里,yield表达式将值 - send
的参数 - 带入函数,同时将计算出的值作为send
的返回值传递。
您可以将对该函数调用返回的生成器传递给更可优化的累加器生成器函数,如下所示:
def EfficientAccumulator(g, q):
"""Similar to Accumulator but sends values to a generator `g`.
>>> from Queue import Queue
>>> from threading import Thread
>>> value_queue = Queue()
>>> g = meangen()
>>> g.send(None)
>>> mean_accumulator = EfficientAccumulator(g, value_queue)
>>> def add_to_queue(it, queue):
... for value in it:
... value_queue.put(value)
>>> putting_thread = Thread(target=add_to_queue,
... args=(range(0, 12, 2), value_queue))
>>> putting_thread.start()
>>> list(mean_accumulator)
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]
"""
try:
while True:
yield(g.send(q.get(timeout=0.1)))
q.task_done()
except Empty:
pass
答案 1 :(得分:1)
如果我在Python中这样做,我会做两件事:
对于第一个,我可能想要一个用于执行累积的API,可能是这样的:
def add(self, num) # add a number
def compute(self) # compute the value of the accumulator
然后我会创建一个AccumulatorRegistry来保存这些累加器,并允许用户调用操作并添加到所有累加器。代码可能如下所示:
class Accumulators(object):
_accumulator_library = {}
def __init__(self):
self.accumulator_library = {}
for key, value in Accumulators._accumulator_library.items():
self.accumulator_library[key] = value()
@staticmethod
def register(name, accumulator):
Accumulators._accumulator_library[name] = accumulator
def add(self, num):
for accumulator in self.accumulator_library.values():
accumulator.add(num)
def compute(self, name):
self.accumulator_library[name].compute()
@staticmethod
def register_decorator(name):
def _inner(cls):
Accumulators.register(name, cls)
return cls
@Accumulators.register_decorator("Mean")
class Mean(object):
def __init__(self):
self.total = 0
self.count = 0
def add(self, num):
self.count += 1
self.total += num
def compute(self):
return self.total / float(self.count)
我应该谈谈你的线程安全问题。 Python的GIL可以保护您免受许多线程问题的困扰。尽管如此,您可以采取一些措施来保护自己: