python字符串拆分多个拆分点

时间:2017-03-31 18:33:25

标签: python split

好的,直接到这里是我的代码

def digestfragmentwithenzyme(seqs, enzymes):

fragment = []
for seq in seqs:
    for enzyme in enzymes:
        results = []
        prog = re.compile(enzyme[0])
        for dingen in prog.finditer(seq):
           results.append(dingen.start() + enzyme[1])
        results.reverse()
        #result = 0
        for result in results:
            fragment.append(seq[result:])
            seq = seq[:result]
        fragment.append(seq[:result])
fragment.reverse()
return fragment

此功能的输入是多个字符串(seq)的列表,例如:

List = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]

和酶作为输入:

[["TC", 1],["GC",1]]

(注意:可以有多个给出,但大多数是与ATCG的这个问题)

该函数应返回一个列表,在本例中,该列表包含2个列表:

Outputlist = [["AATT","CCGGT","CGGGG","CT","CGGGGG"],["AAAG","CAAAAT","CAAAAAAG","CAAAAAAT","C"]]

现在我遇到了将它拆分两次并获得正确输出的麻烦。

关于该功能的更多信息。它通过字符串(seq)查找识别点。在这种情况下,TC或GC将其分解为酶的第二指数。它应该对两个酶的列表中的两个字符串都这样做。

6 个答案:

答案 0 :(得分:1)

假设想法是在每种酶分开,在酶是多个字母的索引点处,并且分裂,实质上在两个字母之间。不需要正则表达式。

您可以通过查找事件并在正确的索引处插入拆分指示符然后对结果进行后处理以实际拆分来执行此操作。

例如:

def digestfragmentwithenzyme(seqs, enzymes):
    # preprocess enzymes once, then apply to each sequence
    replacements = []
    for enzyme in enzymes:
        replacements.append((enzyme[0], enzyme[0][0:enzyme[1]] + '|' + enzyme[0][enzyme[1]:]))
    result = []
    for seq in seqs:
        for r in replacements:
            seq = seq.replace(r[0], r[1])   # So AATTC becomes AATT|C
        result.append(seq.split('|'))       # So AATT|C becomes AATT, C
    return result

def test():
    seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
    enzymes = [["TC", 1],["GC",1]]
    print digestfragmentwithenzyme(seqs, enzymes)

答案 1 :(得分:1)

这是我的解决方案:

TC替换为T C,将GC替换为G C(这是根据给定的索引完成的),然后根据空格字符进行拆分....

def digest(seqs, enzymes):
    res = []
    for li in seqs:
        for en in enzymes: 
            li = li.replace(en[0],en[0][:en[1]]+" " + en[0][en[1]:])
        r = li.split()
        res.append(r)
    return res
seqs = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1],["GC",1]]
#enzymes = [["AAT", 2],["GC",1]]
print seqs
print digest(seqs, enzymes)

结果是:

代表([["TC", 1],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAA
AAAAT', 'C']]

代表([["AAT", 2],["GC",1]])

['AATTCCGGTCGGGGCTCGGGGG', 'AAAGCAAAATCAAAAAAGCAAAAAATC']
[['AA', 'TTCCGGTCGGGG', 'CTCGGGGG'], ['AAAG', 'CAAAA', 'TCAAAAAAG', 'CAAAAAA', '
TC']]

答案 2 :(得分:0)

这是应该使用正则表达式的东西。在这个解决方案中,我找到了所有出现的酶字符串,并使用相应的索引进行拆分。

def digestfragmentwithenzyme(seqs, enzymes):
    out = []
    dic = dict(enzymes) # dictionary of enzyme indices

    for seq in seqs:
        sub = []
        pos1 = 0

        enzstr = '|'.join(enz[0] for enz in enzymes) # "TC|GC" in this case
        for match in re.finditer('('+enzstr+')', seq):
            index = dic[match.group(0)]
            pos2 = match.start()+index
            sub.append(seq[pos1:pos2])
            pos1 = pos2
        sub.append(seq[pos1:])
        out.append(sub)
        # [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]
    return out

答案 3 :(得分:0)

使用正面lookbehind和lookahead正则表达式搜索:

import re


def digest_fragment_with_enzyme(sequences, enzymes):
    pattern = '|'.join('((?<={})(?={}))'.format(strs[:ind], strs[ind:]) for strs, ind in enzymes)
    print pattern  # prints ((?<=T)(?=C))|((?<=G)(?=C))
    for seq in sequences:
        indices = [0] + [m.start() for m in re.finditer(pattern, seq)] + [len(seq)]
        yield [seq[start: end] for start, end in zip(indices, indices[1:])]

seq = ["AATTCCGGTCGGGGCTCGGGGG", "AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = [["TC", 1], ["GC", 1]]
print list(digest_fragment_with_enzyme(seq, enzymes))

<强>输出:

[['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'],
 ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]

答案 4 :(得分:0)

我能想到的最简单的答案:

input_list = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
enzymes = ['TC', 'GC']
output = []
for string in input_list:
    parts = []
    left = 0
    for right in range(1,len(string)):
        if string[right-1:right+1] in enzymes:
            parts.append(string[left:right])
            left = right
    parts.append(string[left:])
    output.append(parts)
print(output)

答案 5 :(得分:0)

把帽子扔进戒指。

  • 将dict用于模式而不是列表列表。
  • 像其他人一样加入模式以避免花哨的正则表达式。

import re

sequences = ["AATTCCGGTCGGGGCTCGGGGG","AAAGCAAAATCAAAAAAGCAAAAAATC"]
patterns = { 'TC': 1, 'GC': 1 }

def intervals(patterns, text):
  pattern = '|'.join(patterns.keys())
  start = 0
  for match in re.finditer(pattern, text):
    index = match.start() + patterns.get(match.group())
    yield text[start:index]
    start = index
  yield text[index:len(text)]

print [list(intervals(patterns, s)) for s in sequences]

# [['AATT', 'CCGGT', 'CGGGG', 'CT', 'CGGGGG'], ['AAAG', 'CAAAAT', 'CAAAAAAG', 'CAAAAAAT', 'C']]