Question

与this和许多其他问题类似，我有许多相同结构的嵌套循环（最多16个）。

问题：我有4个字母的字母，想要得到长度为16的所有可能的单词。我需要过滤这些单词。这些是DNA序列（因此4个字母：ATGC），过滤规则非常简单：

没有XXXX子字符串（即连续的字母数不能超过3次，ATGCAT GGGG CTA是＆＃34;不好＆＃34;）
特定GC含量，即Gs数+ Cs数应在特定范围内（40-50％）。 ATATATATATATA和GCGCGCGCGCGC是坏词

itertools.product将适用于此，但此处的数据结构将是巨型的（4 ^ 16 = 4 * 10 ^ 9个字）

更重要的是，如果我使用product，那么我仍然必须通过每个元素来过滤它。因此，我将有40亿步骤2次

我目前的解决方案是嵌套for循环

alphabet = ['a','t','g','c']
for p1 in alphabet:
    for p2 in alphabet:
       for p3 in alphabet:
       ...skip...
          for p16 in alphabet:
             word = p1+p2+p3+...+p16
             if word_is_good(word):
                 good_words.append(word)
                 counter+=1

没有16个嵌套循环，是否有良好的编程模式？有没有办法有效地并行化（在多核或多个EC2节点上）同样使用该模式我可以在循环中间插入word_is_good?检查：错误开始的单词是坏的

...skip...
for p3 in alphabet:
   word_3 = p1+p2+p3
   if not word_is_good(word_3):
     break
   for p4 in alphabet:
     ...skip...

Answer 1

由于您碰巧有一个长度为4的字母表（或任何＆＃34; 2整数＆＃34; 的幂），使用和整数ID和按位操作的想法来到而不是检查字符串中的连续字符。我们可以为alphabet中的每个字符分配一个整数值，为简单起见，我们可以使用与每个字母对应的索引。

示例：

65463543 ₁₀ = 3321232103313 ₄ = 'aaaddcbcdcbaddbd'

以下函数使用alphabet从基数10整数转换为单词。

def id_to_word(word_id, word_len):
    word = ''
    while word_id:
        rem = word_id & 0x3  # 2 bits pet letter
        word = ALPHABET[rem] + word
        word_id >>= 2  # Bit shift to the next letter
    return '{2:{0}>{1}}'.format(ALPHABET[0], word_len, word)

现在有一个功能可以根据整数ID检查单词是否为＆＃34; good＆＃34; 。以下方法与id_to_word的格式类似，但计数器用于跟踪连续字符。如果超过相同的连续字符的最大数量，该函数将返回False，否则返回True。

def check_word(word_id, max_consecutive):
    consecutive = 0
    previous = None
    while word_id:
        rem = word_id & 0x3
        if rem != previous:
                consecutive = 0
        consecutive += 1
        if consecutive == max_consecutive + 1:
            return False
        word_id >>= 2
        previous = rem
    return True

我们有效地将每个单词视为基数为4的整数。如果字母长度不是＆＃34; 2＆＃34; 的幂，则模{{可以分别使用1}}和整数除法% alpha_len代替// alpha_len和& log2(alpha_len)，但这需要更长的时间。

最后，找到给定>> log2(alpha_len)的所有好词。使用一系列整数值的优点是，您可以将代码中word_len的数量从for-loop减少到word_len，尽管外部循环非常大。这可以允许更好地多处理您的好词搜索任务。我还在快速计算中添加了确定与好词对应的最小和最大ID，这有助于大大缩小搜索好词的范围

在这个循环中，我特意存储了单词的ID而不是实际的单词本身，因为你将使用这些单词进行进一步处理。但是，如果您刚刚在单词之后，则将第二行更改为最后一行以阅读ALPHABET = ('a', 'b', 'c', 'd') def find_good_words(word_len): max_consecutive = 3 alpha_len = len(ALPHABET) # Determine the words corresponding to the smallest and largest ids smallest_word = '' # aaabaaabaaabaaab largest_word = '' # dddcdddcdddcdddc for i in range(word_len): if (i + 1) % (max_consecutive + 1): smallest_word = ALPHABET[0] + smallest_word largest_word = ALPHABET[-1] + largest_word else: smallest_word = ALPHABET[1] + smallest_word largest_word = ALPHABET[-2] + largest_word # Determine the integer ids of said words trans_table = str.maketrans({c: str(i) for i, c in enumerate(ALPHABET)}) smallest_id = int(smallest_word.translate(trans_table), alpha_len) # 1077952576 largest_id = int(largest_word.translate(trans_table), alpha_len) # 3217014720 # Find and store the id's of "good" words counter = 0 goodies = [] for i in range(smallest_id, largest_id + 1): if check_word(i, max_consecutive): goodies.append(i) counter += 1。

注意：我在尝试存储goodies.append(id_to_word(i, word_len))的所有正常ID时收到MemoryError。我建议将这些ID /单词写入某种文件！

Answer 2

from itertools import product, islice
from time import time

length = 16

def generate(start, alphabet):
    """
    A recursive generator function which works like itertools.product
    but restricts the alphabet as it goes based on the letters accumulated so far.
    """

    if len(start) == length:
        yield start
        return

    gcs = start.count('g') + start.count('c')
    if gcs >= length * 0.5:
        alphabet = 'at'

    # consider the maximum number of Gs and Cs we can have in the end
    # if we add one more A/T now
    elif length - len(start) - 1 + gcs < length * 0.4:
        alphabet = 'gc'

    for c in alphabet:
        if start.endswith(c * 3):
            continue

        for string in generate(start + c, alphabet):
            yield string

def brute_force():
    """ Straightforward method for comparison """
    lower = length * 0.4
    upper = length * 0.5
    for s in product('atgc', repeat=length):
        if lower <= s.count('g') + s.count('c') <= upper:
            s = ''.join(s)
            if not ('aaaa' in s or
                    'tttt' in s or
                    'cccc' in s or
                    'gggg' in s):
                yield s

def main():
    funcs = [
        lambda: generate('', 'atgc'),
        brute_force
    ]

    # Testing performance
    for func in funcs:

        # This needs to be big to get an accurate measure,
        # otherwise `brute_force` seems slower than it really is.
        # This is probably because of how `itertools.product`
        # is implemented.
        count = 100000000
        start = time()
        for _ in islice(func(), count):
            pass
        print(time() - start)

    # Testing correctness
    global length
    length = 12
    for x, y in zip(*[func() for func in funcs]):
        assert x == y, (x, y)

main()

在我的机器上，generate比brute_force稍微快一点，大约390秒对比425.这几乎和我一样快。我认为完整的事情需要大约2个小时。当然，实际处理它们需要更长的时间。问题是你的约束不会减少全套。

以下是如何在16个流程中并行使用此示例的示例：

from multiprocessing.pool import Pool

alpha = 'atgc'

def generate_worker(start):
    start = ''.join(start)
    for s in generate(start, alpha):
        print(s)

Pool(16).map(generate_worker, product(alpha, repeat=2))

压缩多个嵌套的`for`循环

2 个答案: