与this和许多其他问题类似,我有许多相同结构的嵌套循环(最多16个)。
问题:我有4个字母的字母,想要得到长度为16的所有可能的单词。我需要过滤这些单词。这些是DNA序列(因此4个字母:ATGC),过滤规则非常简单:
itertools.product
将适用于此,但此处的数据结构将是巨型的(4 ^ 16 = 4 * 10 ^ 9个字)
更重要的是,如果我使用product
,那么我仍然必须通过每个元素来过滤它。因此,我将有40亿步骤2次
我目前的解决方案是嵌套for
循环
alphabet = ['a','t','g','c']
for p1 in alphabet:
for p2 in alphabet:
for p3 in alphabet:
...skip...
for p16 in alphabet:
word = p1+p2+p3+...+p16
if word_is_good(word):
good_words.append(word)
counter+=1
没有16个嵌套循环,是否有良好的编程模式?有没有办法有效地并行化(在多核或多个EC2节点上)
同样使用该模式我可以在循环中间插入word_is_good?
检查:错误开始的单词是坏的
...skip...
for p3 in alphabet:
word_3 = p1+p2+p3
if not word_is_good(word_3):
break
for p4 in alphabet:
...skip...
答案 0 :(得分:1)
由于您碰巧有一个长度为4的字母表(或任何" 2整数" 的幂),使用和整数ID和按位操作的想法来到而不是检查字符串中的连续字符。我们可以为alphabet
中的每个字符分配一个整数值,为简单起见,我们可以使用与每个字母对应的索引。
示例:
65463543
10 = 3321232103313
4 = 'aaaddcbcdcbaddbd'
以下函数使用alphabet
从基数10整数转换为单词。
def id_to_word(word_id, word_len):
word = ''
while word_id:
rem = word_id & 0x3 # 2 bits pet letter
word = ALPHABET[rem] + word
word_id >>= 2 # Bit shift to the next letter
return '{2:{0}>{1}}'.format(ALPHABET[0], word_len, word)
现在有一个功能可以根据整数ID检查单词是否为" good" 。以下方法与id_to_word
的格式类似,但计数器用于跟踪连续字符。如果超过相同的连续字符的最大数量,该函数将返回False
,否则返回True
。
def check_word(word_id, max_consecutive):
consecutive = 0
previous = None
while word_id:
rem = word_id & 0x3
if rem != previous:
consecutive = 0
consecutive += 1
if consecutive == max_consecutive + 1:
return False
word_id >>= 2
previous = rem
return True
我们有效地将每个单词视为基数为4的整数。如果字母长度不是" 2" 的幂,则模{{可以分别使用1}}和整数除法% alpha_len
代替// alpha_len
和& log2(alpha_len)
,但这需要更长的时间。
最后,找到给定>> log2(alpha_len)
的所有好词。使用一系列整数值的优点是,您可以将代码中word_len
的数量从for-loop
减少到word_len
,尽管外部循环非常大。这可以允许更好地多处理您的好词搜索任务。我还在快速计算中添加了确定与好词对应的最小和最大ID,这有助于大大缩小搜索好词的范围
2
在这个循环中,我特意存储了单词的ID而不是实际的单词本身,因为你将使用这些单词进行进一步处理。但是,如果您刚刚在单词之后,则将第二行更改为最后一行以阅读ALPHABET = ('a', 'b', 'c', 'd')
def find_good_words(word_len):
max_consecutive = 3
alpha_len = len(ALPHABET)
# Determine the words corresponding to the smallest and largest ids
smallest_word = '' # aaabaaabaaabaaab
largest_word = '' # dddcdddcdddcdddc
for i in range(word_len):
if (i + 1) % (max_consecutive + 1):
smallest_word = ALPHABET[0] + smallest_word
largest_word = ALPHABET[-1] + largest_word
else:
smallest_word = ALPHABET[1] + smallest_word
largest_word = ALPHABET[-2] + largest_word
# Determine the integer ids of said words
trans_table = str.maketrans({c: str(i) for i, c in enumerate(ALPHABET)})
smallest_id = int(smallest_word.translate(trans_table), alpha_len) # 1077952576
largest_id = int(largest_word.translate(trans_table), alpha_len) # 3217014720
# Find and store the id's of "good" words
counter = 0
goodies = []
for i in range(smallest_id, largest_id + 1):
if check_word(i, max_consecutive):
goodies.append(i)
counter += 1
。
注意:我在尝试存储goodies.append(id_to_word(i, word_len))
的所有正常ID时收到MemoryError
。我建议将这些ID /单词写入某种文件!
答案 1 :(得分:1)
from itertools import product, islice
from time import time
length = 16
def generate(start, alphabet):
"""
A recursive generator function which works like itertools.product
but restricts the alphabet as it goes based on the letters accumulated so far.
"""
if len(start) == length:
yield start
return
gcs = start.count('g') + start.count('c')
if gcs >= length * 0.5:
alphabet = 'at'
# consider the maximum number of Gs and Cs we can have in the end
# if we add one more A/T now
elif length - len(start) - 1 + gcs < length * 0.4:
alphabet = 'gc'
for c in alphabet:
if start.endswith(c * 3):
continue
for string in generate(start + c, alphabet):
yield string
def brute_force():
""" Straightforward method for comparison """
lower = length * 0.4
upper = length * 0.5
for s in product('atgc', repeat=length):
if lower <= s.count('g') + s.count('c') <= upper:
s = ''.join(s)
if not ('aaaa' in s or
'tttt' in s or
'cccc' in s or
'gggg' in s):
yield s
def main():
funcs = [
lambda: generate('', 'atgc'),
brute_force
]
# Testing performance
for func in funcs:
# This needs to be big to get an accurate measure,
# otherwise `brute_force` seems slower than it really is.
# This is probably because of how `itertools.product`
# is implemented.
count = 100000000
start = time()
for _ in islice(func(), count):
pass
print(time() - start)
# Testing correctness
global length
length = 12
for x, y in zip(*[func() for func in funcs]):
assert x == y, (x, y)
main()
在我的机器上,generate
比brute_force
稍微快一点,大约390秒对比425.这几乎和我一样快。我认为完整的事情需要大约2个小时。当然,实际处理它们需要更长的时间。问题是你的约束不会减少全套。
以下是如何在16个流程中并行使用此示例的示例:
from multiprocessing.pool import Pool
alpha = 'atgc'
def generate_worker(start):
start = ''.join(start)
for s in generate(start, alpha):
print(s)
Pool(16).map(generate_worker, product(alpha, repeat=2))