使用Python进行序列匹配

时间:2015-03-29 20:58:10

标签: python alignment sequence

我正在研究RNA序列匹配

seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']

我将sub_seq与seq匹配,匹配的sub_seq在seq下,如果没有匹配,请使用虚线。输出如下:

UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------

我尝试使用字典来执行此操作

index_dict = {}
for i in xrange(len(sub_seq)):
    index_dict[seq.find(sub_seq[i])] = {}
    index_dict[seq.find(sub_seq[i])]['sequence'] = sub_seq[i]
    index_dict[seq.find(sub_seq[i])]['end_index'] = seq.find(sub_seq[i]) + len(sub_seq[i]) - 1

我无法弄清楚算法做对齐,任何帮助都将不胜感激!

2 个答案:

答案 0 :(得分:4)

seq_l = len(seq)
for ele in sub_seq:
    start = seq.find(ele)
    ln = len(ele)
    if start != -1:
        end = start + ln
        print("-" * start + ele + "-"*(seq_l- end))
    else:
        print("-" * seq_l)

-----UGUCAG--------
--------CAGUCA-----
UCAGCU-------------
---------------GAUC

不确定UCAGCU--CAGUCA-GAUC来自何处,因为您在代码中一次只使用一个子序列

答案 1 :(得分:2)

假设您让我稍微更改您的index_dict,请考虑:

seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']

index_dict = {}
for i in xrange(len(sub_seq)):
    index_dict[seq.find(sub_seq[i])] = {
        'sequence':  sub_seq[i],
        'end_index': seq.find(sub_seq[i]) + len(sub_seq[i])   # Note this changed
    }
sorted_keys = sorted(index_dict)

lines = []
while True:
    if not sorted_keys: break
    line = []
    next_index = 0
    for k in sorted_keys:
        if k >= next_index:
            line.append(k)
            next_index = index_dict[k]['end_index']
    # Remove keys we used, append line to lines
    for k in line: sorted_keys.remove(k)
    lines.append(line)

# Build output lines
olines = []
for line in lines:
    oline = ''
    for k in line:
        oline += '-' * (k - len(oline))     # Add dashes before subseq
        oline += index_dict[k]['sequence']  # Add subsequence
    oline += '-' * (len(seq) - len(oline))  # Add trailing dashes
    olines.append(oline)

print seq
print '\n'.join(olines)

输出:

UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------

请注意,这非常详细,可能会有点浓缩。 while Truefor line in lines循环可能合并为一个,但它应该有助于解释一种可能的方法。

编辑:这是您加入最后两个循环的一种方式:

seq = 'UCAGCUGUCAGUCAUGAUC'
sub_seq =['UGUCAG', 'CAGUCA', 'UCAGCU','GAUC']

index_dict = {}
for i in xrange(len(sub_seq)):
    index_dict[seq.find(sub_seq[i])] = {
        'sequence':  sub_seq[i],
        'end_index': seq.find(sub_seq[i]) + len(sub_seq[i])   # Note this changed
    }
sorted_keys = sorted(index_dict)

lines = []
while True:
    if not sorted_keys: break
    line = ''
    next_index = 0
    keys_used = []
    for k in sorted_keys:
        if k >= next_index:
            line += '-' * (k - len(line))           # Add dashes before subseq
            line += index_dict[k]['sequence']       # Add subsequence
            next_index = index_dict[k]['end_index'] # Update next_index
            keys_used.append(k)                     # Mark key as used
    for k in keys_used: sorted_keys.remove(k)       # Remove used keys
    line += '-' * (len(seq) - len(line))            # Add trailing dashes
    lines.append(line)                              # Add line to lines

print seq
print '\n'.join(lines)

输出:

UCAGCUGUCAGUCAUGAUC
UCAGCU--CAGUCA-GAUC
-----UGUCAG--------