丢弃T恤元素

时间:2014-08-26 14:21:31

标签: python stream large-data

我希望“分叉”大量数据流,以便只查看几个元素。

我希望写下这样的东西:

from itertools import tee
stream = # a generator of a very large data stream 

while True:
    try:
        element= stream.next()
        process_element( element )
        if some_condition( element ):
            stream, fork= tee(stream)
            process_fork( fork )
    except StopIteration:
        break

阅读the documentation for tee但是,即使在deque超出范围之后,我仍然认为fork的{​​{1}}会继续增长。< / p>

是这样的吗?如果是这样,有没有办法告诉fork“丢弃”分叉?或者还有另一种更明显的方法吗?

2 个答案:

答案 0 :(得分:1)

您可以通过创建Tee类并为其提供discard()方法来避免依赖于实现的行为@goncalopp:

class Tee(object):
    def __init__(self, iterable, n=2):
        it = iter(iterable)
        self.deques = [collections.deque() for _ in range(n)]
        def gen(mydeque):
            while True:
                if not mydeque:             # when the local deque is empty
                    newval = next(it)       # fetch a new value and
                    for d in self.deques:   # load it to all the active deques
                        d.append(newval)
                yield mydeque.popleft()
        self.generators = [gen(d) for d in self.deques]

    def __call__(self):
        return self.generators

    def discard(gen):
        index = self.generators.index(gen)
        del self.deques[index]
        del self.generators[index]

请注意,因为它现在是一个类,使用它会略有不同。但是,当您使用fork完成后,您可以通过调用tee.discard(fork)来摆脱它。这是一个例子:

tee = None
while True:
    try:
        element = stream.next()
        process_element(element)
        if some_condition(element):
            if not tee:
                tee = Tee(stream)
                stream, fork = tee()
            process_fork(fork)
    except StopIteration:
        break

if tee:
    tee.discard(fork)
    fork = None

答案 1 :(得分:0)

这是一个简单的测试脚本:

from itertools import tee

def natural_numbers():
    i=0
    while True:
        yield i
        i+=1

stream = natural_numbers()  #Don't use xrange, cpython optimizes it away
stream, fork= tee(stream)
del fork
for e in stream:
    pass

似乎至少在CPython中,过程&#39;记忆力不会持续增长。 There seems to be a mechanism that detects this situation

但是,如果用tee替换python代码the documentation状态是等效的......

def tee(iterable, n=2):
    it = iter(iterable)
    deques = [collections.deque() for i in range(n)]
    def gen(mydeque):
        while True:
            if not mydeque:             # when the local deque is empty
                newval = next(it)       # fetch a new value and
                for d in deques:        # load it to all the deques
                    d.append(newval)
            yield mydeque.popleft()
    return tuple(gen(d) for d in deques)

......记忆确实在不断增长,正如预期的那样。

所以,我的猜测是,这将是依赖于实现的行为