以递归方式重新生成随机项

时间:2014-10-12 23:48:18

标签: python recursion random data-generation

对于数据库测试,我需要生成查询。为了降低复杂性,我们假设只有" insert" - 和" select" -queries,我们只存储最多2 ^ 64的整数。数据库中的条目分为两个级别:主键和群集键。每个主密钥最多可包含2 ^ 64个唯一的群集密钥,最多可包含2 ^ 64个唯一数据项。

对于每个插入查询,给出两个机会值:

  1. 它将创建一个新的主键和
  2. 它将为现有项目创建新的群集密钥。
  3. 我还有一个伪随机数生成器,以及已经生成的项目数。此数字还用于在创建新项目时为随机生成器设定种子。请参阅代码,了解我是如何尝试这样做的:

    from random import Random
    
    def generate_seeds(main_chance, cluster_chance, max_generated):
            generator = Random()
            new = main_chance > generator.random()
    
            # increase the counter if a new item is generated
            max_generated += new
    
            # We chose "insert", so a new item needs to be generated
            if new:
                main_key = max_generated
                # seed the generator with that main_key
                generator.seed(main_key)
                # now determine if a whole new item will be generated
                # or an old key gets new additional items
    
                # Save the main seed. In case we just add an item
                # the main seed will be an old one and the main
                # seed will only be used for the new items.
                cluster_key = main_key
                add_item = cluster_chance > generator.random()
                # check if a completely new item will be generated
                if (not add_item) and new:
                    return main_key, main_key, max_generated
    
                # We need an old main key that created a new item, so iterate
                # over the old keys until we find one that did. If no key was
                # ever used to create a completely new item, fall back to
                # seed zero, which always generates a completely new item.
                if add_item:
                    # if the cluster_chance is big we might iterate very often :(
                    for main_key in generator.sample(xrange(main_key), main_key):
                        generator.seed(main_key)
                        if cluster_chance < generator.random() or \
                           cluster_key == 0:
                            break
                    else:
                        # special case: no items have been generated yet
                        main_key = 0
                    return main_key, cluster_key, max_generated
            else:
            # The choice was "select", regenerate an old item
                choice = generator.randint(0, max_generated)
                return generate_seeds(1, cluster_chance, choice)
    

    问题:可能很多&#34;递归&#34;在add_item之后调用for循环,更有可能是更大的cluster_chance

    如何以更好的方式解决这个问题?


    编辑:我想到的唯一想法是构建一个int列表。 list [n]是:

    • n,如果n用于生成一个带有main key = cluster key
    • 的全新项目
    • 一些主键k是&lt; n,如果为n生成了新的集群,那么main key = k,cluster key = n

    问题是,此解决方案使用了大量内存:d = [x for x in xrange(100000000)](1亿个值)使用3.183.344KiB内存,因此每个值约为32,6字节,或每千兆字节32.939.450。因此,使用32GiB RAM,可以管理大约10亿个值 - 很好,但还不够好。

0 个答案:

没有答案