Question

我这里有两个问题。
1）我的最初问题是试图改善算法的计算时间
2）我的下一个问题是我的“改进”之一，似乎消耗了500G以上的RAM内存，我真的不知道为什么。

我一直在为生物信息学管道编写置换。基本上，该算法从我拥有的每个变体开始采样1000个空变体。如果不满足特定条件，则算法这次以10000个零值重复采样。如果不再次满足该条件，则此值将升至100000，然后每个变体1000000为空。

我已尽我所能来优化此功能，这时我很想改善。我有这种粗糙的理解力：

output_names_df, output_p_values_df = map(list, zip(*[random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num) for variant in variant_list]))

基本上，所有这些操作就是在变量列表中的每个变量上调用我的random_sampling_for_variants函数，并将该函数的两个输出抛出到列表列表中（因此，我最终得到了两个列表列表，即output_names_df和output_p_values_df ）。然后，我将这些列表列表变回DataFrames，重命名这些列，然后对它们进行所需的操作。调用的函数如下：

def random_sampling_for_variant(variant, options_tables, variant_to_bins, num_permutations, use_num):
    """
    Inner function to permute_null_variants_single
    equivalent to permutation for 1 variant
    """
    #Get nulls that are in the same bin as our variant
    permuted_null_variant_table = options_tables.get_group(variant_to_bins[variant])
    #If number of permutations > number of nulls, sample with replacement
    if num_permutations >= len(permuted_null_variant_table):
        replace = True
    else:
        replace = False
    #Select rows for permutation, then add as columns to our temporary dfs
    picked_indices = permuted_null_variant_table.sample(n=num_permutations, replace=replace)
    temp_names_df = picked_indices['variant'].values[:use_num]
    temp_p_values_df = picked_indices['GWAS_p_value'].values
    return(temp_names_df, temp_p_values_df)

在定义permuted_null_variant_table时，我只是在查询预定义的分组DataFrame以确定要从中采样的适当的空表。我发现这比尝试确定要从中进行采样的适当空值要快。那里的逻辑决定了我是否需要进行替换或不进行采样，并且几乎不需要任何时间。定义pick_indices是实际进行随机采样的地方。 temp_names_df和temp_p_values_df从空表中获取所需的值，然后将返回值从函数中发送出去。

这里的问题是上述内容无法很好地扩展。对于1000个排列，上面的代码行在7725个变体上花费约254秒。对于10,000个排列，约333秒。对于100,000个排列，约720秒。对于1,000,000个排列（这是我想要达到的上限），我的进程被终止了，因为它显然要占用比群集更多的RAM。

我对如何进行感到困惑。我已经将所有int变成8bit-uint，将所有的float变成了16bit-float。我将不需要的列放在不需要的地方，因此我只从带有所需列的表中采样。最初，我以循环的形式进行了理解，但是对理解的计算时间是循环时间的3/5。任何反馈表示赞赏。

优化随机采样以缩放排列

0 个答案: