使用itertools.product并想要为一个值设定种子

时间:2012-03-25 22:55:46

标签: python image download seed itertools

所以我写了一个小脚本来从网站上下载图片。它通过一个7 alpha字符值,其中第一个字符始终是一个数字。问题是,如果我想停止脚本并重新启动它,我必须从头开始。

我可以用我得到的最后一个值以某种方式播种itertools.product,所以我不必再次浏览它们。

感谢您的任何意见。

这是代码的一部分:

numbers = '0123456789'
alnum = numbers + 'abcdefghijklmnopqrstuvwxyz'

len7 = itertools.product(numbers, alnum, alnum, alnum, alnum, alnum, alnum) # length 7

for p in itertools.chain(len7):
    currentid = ''.join(p) 

    #semi static vars
    url = 'http://mysite.com/images/'
    url += currentid

    #Need to get the real url cause the redirect
    print "Trying " + url
    req = urllib2.Request(url)
    res = openaurl(req)
    if res == "continue": continue
    finalurl = res.geturl()

    #ok we have the full url now time to if it is real
    try: file = urllib2.urlopen(finalurl)
    except urllib2.HTTPError, e:
        print e.code

    im = cStringIO.StringIO(file.read())
    img = Image.open(im)
    writeimage(img)

3 个答案:

答案 0 :(得分:3)

这是基于pypy库代码的解决方案(感谢agf在评论中的建议)。

状态可通过.state属性获得,并可通过.goto(state)重置,其中state是序列的索引(从0开始)。最后有一个演示(你需要向下滚动,我很害怕)。

这比丢弃值更快。

> cat prod.py 

class product(object):

    def __init__(self, *args, **kw):
        if len(kw) > 1:
            raise TypeError("product() takes at most 1 argument (%d given)" %
                             len(kw))
        self.repeat = kw.get('repeat', 1)
        self.gears = [x for x in args] * self.repeat
        self.num_gears = len(self.gears)
        self.reset()

    def reset(self):
        # initialization of indicies to loop over
        self.indicies = [(0, len(self.gears[x]))
                         for x in range(0, self.num_gears)]
        self.cont = True
        self.state = 0

    def goto(self, n):
        self.reset()
        self.state = n
        x = self.num_gears
        while n > 0 and x > 0:
            x -= 1
            n, m = divmod(n, len(self.gears[x]))
            self.indicies[x] = (m, self.indicies[x][1])
        if n > 0:
            self.reset()
            raise ValueError("state exceeded")

    def roll_gears(self):
        # Starting from the end of the gear indicies work to the front
        # incrementing the gear until the limit is reached. When the limit
        # is reached carry operation to the next gear
        self.state += 1
        should_carry = True
        for n in range(0, self.num_gears):
            nth_gear = self.num_gears - n - 1
            if should_carry:
                count, lim = self.indicies[nth_gear]
                count += 1
                if count == lim and nth_gear == 0:
                    self.cont = False
                if count == lim:
                    should_carry = True
                    count = 0
                else:
                    should_carry = False
                self.indicies[nth_gear] = (count, lim)
            else:
                break

    def __iter__(self):
        return self

    def next(self):
        if not self.cont:
            raise StopIteration
        l = []
        for x in range(0, self.num_gears):
            index, limit = self.indicies[x]
            l.append(self.gears[x][index])
        self.roll_gears()
        return tuple(l)

p = product('abc', '12')
print list(p)
p.reset()
print list(p)
p.goto(2)
print list(p)
p.goto(4)
print list(p)
> python prod.py 
[('a', '1'), ('a', '2'), ('b', '1'), ('b', '2'), ('c', '1'), ('c', '2')]
[('a', '1'), ('a', '2'), ('b', '1'), ('b', '2'), ('c', '1'), ('c', '2')]
[('b', '1'), ('b', '2'), ('c', '1'), ('c', '2')]
[('c', '1'), ('c', '2')]

你应该更多地测试它 - 我可能犯了一个愚蠢的错误 - 但这个想法很简单,所以你应该能够解决它:o)你可以自由地使用我的改变;不知道原始的pypy许可证是什么。

state也不是真正的完整状态 - 它不包括原始参数 - 它只是序列的索引。也许将它称为索引会更好,但代码中已经有了指示...

<强>更新

这是一个更简单的版本,它是相同的想法,但通过转换一系列数字来工作。所以你只需imap通过count(n)就可以获得n的序列偏移量。

> cat prod2.py 

from itertools import count, imap

def make_product(*values):
    def fold((n, l), v):
        (n, m) = divmod(n, len(v))
        return (n, l + [v[m]])
    def product(n):
        (n, l) = reduce(fold, values, (n, []))
        if n > 0: raise StopIteration
        return tuple(l)
    return product

print list(imap(make_product(['a','b','c'], [1,2,3]), count()))
print list(imap(make_product(['a','b','c'], [1,2,3]), count(3)))

def product_from(n, *values):
    return imap(make_product(*values), count(n))

print list(product_from(4, ['a','b','c'], [1,2,3]))

> python prod2.py 
[('a', 1), ('b', 1), ('c', 1), ('a', 2), ('b', 2), ('c', 2), ('a', 3), ('b', 3), ('c', 3)]
[('a', 2), ('b', 2), ('c', 2), ('a', 3), ('b', 3), ('c', 3)]
[('b', 2), ('c', 2), ('a', 3), ('b', 3), ('c', 3)]

(这里的缺点是,如果你想停下来并重新启动,你需要跟踪自己已经使用了多少)

答案 1 :(得分:2)

一旦你在迭代器上得到了一个公平的方法,那么使用dropwhile需要一段时间才能到达现场。

您可能应该调整this这样的食谱,这样就可以在运行之间保存状态。

确保您的脚本一次只能运行一次,或者您需要更复杂的内容,例如将ID分发给脚本的服务器进程

答案 2 :(得分:1)

如果您的输入序列没有任何重复值,这可能比dropwhile提前product更快,因为它不需要您通过计算正确来比较所有丢弃的值指向恢复迭代。

from itertools import product, islice
from operator import mul

def resume_product(state, *sequences):
    start = 0
    seqlens = map(len, sequences)
    if any(len(set(seq)) != seqlen for seq, seqlen in zip(sequences, seqlens)):
        raise ValueError("One of your sequences contains duplicate values")
    current = end = reduce(mul, seqlens)
    for i, seq, seqlen in zip(state, sequences, seqlens):
        current /= seqlen
        start += seq.index(i) * current
    return islice(product(*sequences), start + 1, end)


seqs = '01', '23', '45', '678'        

# if I want to resume after '1247':
for i in resume_product('1247', *seqs):
    # blah blah
    pass