具有约束的散列序列

时间:2015-09-16 02:23:56

标签: python python-2.7

我必须修改以下代码以找到允许每三个碱基(核苷酸)不匹配的所有100聚体(100个核苷酸的大块)。任何关于如何处理这个的逻辑将不胜感激。谢谢!

# length of hash key
kmerlen = 30

# hash table for finding hits
lookup = defaultdict(list)

# store sequence hashes in hash table
print("hashing seq1...")
for i in xrange(len(seq1) - kmerlen + 1):
    key = seq1[i:i+kmerlen]
    lookup[key].append(i)

# look up hashes in hash table
print("hashing seq2...")
hits = []
for i in xrange(len(seq2) - kmerlen + 1):
    key = seq2[i:i+kmerlen]

    # store hits to hits list
    for hit in lookup.get(key, []):
        hits.append((i, hit))

# hits should be a list of tuples
# [(index1_in_seq2, index1_in_seq1),
#  (index2_in_seq2, index2_in_seq1),
#  ...]

1 个答案:

答案 0 :(得分:0)

我认为您只需要在切片表达式中添加kmerlen = 30 kmerstep = 1 # new variable # ... for i in xrange(len(seq1) - kmerlen + 1): key = seq1[i:i+kmerlen:kmerstep] # add step to slice lookup[key].append(i) # ... for i in xrange(len(seq2) - kmerlen + 1): key = seq2[i:i+kmerlen:kmerstep] # here too 术语:

key = seq1[i:i+kmerlen:3]+seq1[i+1:i+kmerlen:3]

这适用于您的任务2-4,调整后的步长。任务5有点棘手,因为它需要每三个项目中有两个匹配。我建议连接两个切片:

range

项目不是有序的,但是如果你以相同的方式切割其他序列,它们应该完全对应。您可能还需要为i调整循环 race.White race.Hispanic race.Black race.Asian 1 1 0 0 0 2 0 0 0 1 3 1 0 0 0 4 0 0 1 0 5 0 0 0 1 6 0 1 0 0 7 1 0 0 0 8 1 0 0 0 9 1 0 0 0 10 0 0 1 0