我必须修改以下代码以找到允许每三个碱基(核苷酸)不匹配的所有100聚体(100个核苷酸的大块)。任何关于如何处理这个的逻辑将不胜感激。谢谢!
# length of hash key kmerlen = 30 # hash table for finding hits lookup = defaultdict(list) # store sequence hashes in hash table print("hashing seq1...") for i in xrange(len(seq1) - kmerlen + 1): key = seq1[i:i+kmerlen] lookup[key].append(i) # look up hashes in hash table print("hashing seq2...") hits = [] for i in xrange(len(seq2) - kmerlen + 1): key = seq2[i:i+kmerlen] # store hits to hits list for hit in lookup.get(key, []): hits.append((i, hit)) # hits should be a list of tuples # [(index1_in_seq2, index1_in_seq1), # (index2_in_seq2, index2_in_seq1), # ...]
答案 0 :(得分:0)
我认为您只需要在切片表达式中添加kmerlen = 30
kmerstep = 1 # new variable
# ...
for i in xrange(len(seq1) - kmerlen + 1):
key = seq1[i:i+kmerlen:kmerstep] # add step to slice
lookup[key].append(i)
# ...
for i in xrange(len(seq2) - kmerlen + 1):
key = seq2[i:i+kmerlen:kmerstep] # here too
术语:
key = seq1[i:i+kmerlen:3]+seq1[i+1:i+kmerlen:3]
这适用于您的任务2-4,调整后的步长。任务5有点棘手,因为它需要每三个项目中有两个匹配。我建议连接两个切片:
range
项目不是有序的,但是如果你以相同的方式切割其他序列,它们应该完全对应。您可能还需要为i
调整循环 race.White race.Hispanic race.Black race.Asian
1 1 0 0 0
2 0 0 0 1
3 1 0 0 0
4 0 0 1 0
5 0 0 0 1
6 0 1 0 0
7 1 0 0 0
8 1 0 0 0
9 1 0 0 0
10 0 0 1 0
。