最近一个类似的问题(isinstance(foo, types.GeneratorType) or inspect.isgenerator(foo)?)让我对如何一般地实现这个问题感到好奇。
实际上,拥有一个将首次缓存的生成器类型对象(如itertools.cycle
),报告StopIteration,然后从缓存中返回项目似乎是一个普遍有用的事情。时间通过,但如果对象不是生成器(即本身支持O(1)查找的列表或字典),则不缓存,并且具有相同的行为,但对于原始列表。
的可能性:
1)修改itertools.cycle。它看起来像这样:
def cycle(iterable):
saved = []
try:
saved.append(iterable.next())
yield saved[-1]
isiter = True
except:
saved = iterable
isiter = False
# cycle('ABCD') --> A B C D A B C D A B C D ...
for element in iterable:
yield element
if isiter:
saved.append(element)
# ??? What next?
如果我可以重新启动生成器,那将是完美的 - 我可以发回一个StopIteration,然后在下一个gen.next(),返回条目0即`ABCD StopIteration ABCD StopIteration'但它看起来不像这实际上是可能的。
第二个是,一旦命中StopIteration,则保存有缓存。但它看起来没有任何方法可以进入内部保存的[]字段。也许是这个版本的?
2)或者我可以直接传入列表:
def cycle(iterable, saved=[]):
saved.clear()
try:
saved.append(iterable.next())
yield saved[-1]
isiter = True
except:
saved = iterable
isiter = False
# cycle('ABCD') --> A B C D A B C D A B C D ...
for element in iterable:
yield element
if isiter:
saved.append(element)
mysaved = []
myiter = cycle(someiter, mysaved)
但这看起来很讨厌。在C / ++中,我可以传入一些引用,并将实际引用更改为指向可迭代 - 您实际上无法在python中执行此操作。所以这甚至不起作用。
其他选择?
编辑:更多数据。 CachingIterable方法看起来太慢而无法有效,但确实让我朝着可行的方向发展。它比天真的方法稍慢(转换为自己列表),但如果它已经可迭代,似乎不会受到攻击。
一些代码和数据:
def cube_generator(max=100):
i = 0
while i < max:
yield i*i*i
i += 1
# Base case: use generator each time
%%timeit
cg = cube_generator(); [x for x in cg]
cg = cube_generator(); [x for x in cg]
cg = cube_generator(); [x for x in cg]
10000 loops, best of 3: 55.4 us per loop
# Fastest case: flatten to list, then iterate
%%timeit
cg = cube_generator()
cl = list(cg)
[x for x in cl]
[x for x in cl]
[x for x in cl]
10000 loops, best of 3: 27.4 us per loop
%%timeit
cg = cube_generator()
ci2 = CachingIterable(cg)
[x for x in ci2]
[x for x in ci2]
[x for x in ci2]
1000 loops, best of 3: 239 us per loop
# Another attempt, which is closer to the above
# Not exactly the original solution using next, but close enough i guess
class CacheGen(object):
def __init__(self, iterable):
if isinstance(iterable, (list, tuple, dict)):
self._myiter = iterable
else:
self._myiter = list(iterable)
def __iter__(self):
return self._myiter.__iter__()
def __contains__(self, key):
return self._myiter.__contains__(key)
def __getitem__(self, key):
return self._myiter.__getitem__(key)
%%timeit
cg = cube_generator()
ci = CacheGen(cg)
[x for x in ci]
[x for x in ci]
[x for x in ci]
10000 loops, best of 3: 30.5 us per loop
# But if you start with a list, it is faster
cg = cube_generator()
cl = list(cg)
%%timeit
[x for x in cl]
[x for x in cl]
[x for x in cl]
100000 loops, best of 3: 11.6 us per loop
%%timeit
ci = CacheGen(cl)
[x for x in ci]
[x for x in ci]
[x for x in ci]
100000 loops, best of 3: 13.5 us per loop
任何可以更接近'纯'循环的快速食谱?
答案 0 :(得分:5)
你想要的不是迭代器,而是迭代器。迭代器只能遍历其内容一次迭代。你想要一些带有迭代器的东西然后你可以多次迭代,从迭代器产生相同的值,即使迭代器不记得它们,就像生成器一样。然后,这只是一个特殊的问题 - 那些不需要缓存的输入。这是一个非线程安全的示例(编辑:为效率而更新):
import itertools
class AsYouGoCachingIterable(object):
def __init__(self, iterable):
self.iterable = iterable
self.iter = iter(iterable)
self.done = False
self.vals = []
def __iter__(self):
if self.done:
return iter(self.vals)
#chain vals so far & then gen the rest
return itertools.chain(self.vals, self._gen_iter())
def _gen_iter(self):
#gen new vals, appending as it goes
for new_val in self.iter:
self.vals.append(new_val)
yield new_val
self.done = True
还有一些时间:
class ListCachingIterable(object):
def __init__(self, obj):
self.vals = list(obj)
def __iter__(self):
return iter(self.vals)
def cube_generator(max=1000):
i = 0
while i < max:
yield i*i*i
i += 1
def runit(iterable_factory):
for i in xrange(5):
for what in iterable_factory():
pass
def puregen():
runit(lambda: cube_generator())
def listtheniter():
res = list(cube_generator())
runit(lambda: res)
def listcachingiterable():
res = ListCachingIterable(cube_generator())
runit(lambda: res)
def asyougocachingiterable():
res = AsYouGoCachingIterable(cube_generator())
runit(lambda: res)
结果是:
In [59]: %timeit puregen()
1000 loops, best of 3: 774 us per loop
In [60]: %timeit listtheniter()
1000 loops, best of 3: 345 us per loop
In [61]: %timeit listcachingiterable()
1000 loops, best of 3: 348 us per loop
In [62]: %timeit asyougocachingiterable()
1000 loops, best of 3: 630 us per loop
因此,就类ListCachingIterable
而言,最简单的方法与手动执行list
的工作方式相同。 “即用型”变体的速度几乎是传输速度的两倍,但如果不使用整个列表,则具有优势,例如:说你只是在寻找超过100的第一个立方体:
def first_cube_past_100(cubes):
for cube in cubes:
if cube > 100:
return cube
raise Error("No cube > 100 in this iterable")
然后:
In [76]: %timeit first_cube_past_100(cube_generator())
100000 loops, best of 3: 2.92 us per loop
In [77]: %timeit first_cube_past_100(ListCachingIterable(cube_generator()))
1000 loops, best of 3: 255 us per loop
In [78]: %timeit first_cube_past_100(AsYouGoCachingIterable(cube_generator()))
100000 loops, best of 3: 10.2 us per loop
答案 1 :(得分:4)
基于此评论:
我的意图是,只有当用户知道他想要在'iterable'上多次迭代时才会使用,但不知道输入是生成器还是可迭代的。这可以让你忽略这种区别,同时不会失去(很多)性能。
这个简单的解决方案就是这样做的:
def ensure_list(it):
if isinstance(it, (list, tuple, dict)):
return it
else:
return list(it)
现在ensure_list(a_list)
实际上是一个无操作 - 两个函数调用 - 而ensure_list(a_generator)
会将其转换为一个列表并返回它,结果证明它比任何其他方法都快。
答案 2 :(得分:0)
刚刚做了一个 library 来解决这个问题——支持缓存返回迭代器的函数:
from typing import *
from cacheable_iter import iter_cache
@iter_cache
def iterator_function(n: int) -> Iterator[int]:
yield from range(n)
用法示例:
from typing import *
from cacheable_iter import iter_cache
@iter_cache
def my_iter(n: int) -> Iterator[int]:
print(" * my_iter called")
for i in range(n):
print(f" * my_iter step {i}")
yield i
gen1 = my_iter(4)
print("Creating an iterator...")
print(f"The first value of gen1 is {next(gen1)}")
print(f"The second value of gen1 is {next(gen1)}")
gen2 = my_iter(4)
print("Creating an iterator...")
print(f"The first value of gen2 is {next(gen2)}")
print(f"The second value of gen2 is {next(gen2)}")
print(f"The third value of gen2 is {next(gen2)}")
哪个会打印:
Creating an iterator...
* my_iter called
* my_iter step 0
The first value of gen1 is 0
* my_iter step 1
The second value of gen1 is 1
Creating an iterator...
The first value of gen2 is 0
The second value of gen2 is 1
* my_iter step 2
The third value of gen2 is 2
还支持缓存等待迭代器和异步迭代器