组合字符串,提取子字符串

时间:2013-11-18 00:17:30

标签: python string

(我正在使用python)

我正在使用大量的RNA序列文件,我正在尝试将其重新格式化以用于群集程序。我的文件由两种“行”组成。 1)细菌的登录号,(期间)该序列开始的核苷酸,(期间)其结束的核苷酸。 2)实际序列本身的行(跨越多行,即使它是连续的序列):

  

> A45315.1.1521 \ n   GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGCAGGAAGCCGGCGGAUCCC \ n   UUCGGGGUGAANCCGGUGGAAUGAGCGGCGGACGGGUGAGUAACACGUGGGCAACCUACC \ n   UUGUAGACUGGGAUAACUCCGGGAAACCGGGGCUAAUACCGGAUGAUCAUUUGGAUCGCAU \ n   GAUCCGAAUGUAAAAGUGGGGAUUUAUCCUCACACUGCAAGAUGGGCCCGCGGCGCA ... ..   > A93610.15.1301 \ n   CCACUGCUAUGGGGGUCCGACUAAGCCAUGCGAGUCAUGGGGUCCCUCUGGGACACCACC \ n   GGCGGACGGCUCAGUAACACGUCGGUAACCUACCCUCGGGAGGGGGAUAACCCCGGGAAA \ n   CUGGGGCUAAUCCCCCAUAGGCCUGAGGUACUGGAAGGUCCUCAGGCCGAAAGGGGCUU ...

我需要创建一些查看以>开头的行的内容,并转到第一个小数后面的数字(因此高于1和15)。从这个数字开始计数(在上面的例子中是1或15),它需要提取从69开始的核苷酸(As,Cs,Gs或Us)并转到497(注意这个例子我拿了一堆(核苷酸)。

因此,对于我的尝试,我认为将核苷酸序列制成一个长串是有意义的,然后尝试提取核苷酸。但我似乎无法将RNA序列的行分成一个长串(见下文我试过的)。一旦我有大字符串,我不知道如何提取正确的核苷酸。我需要编写类似s = [x:497]的内容,其中x是69-(在第一个小数之前插入该数字)。

 #!/usr/bin/env python
 #Make a program that takes SSURef_NR99 file of sequences, makes a new file of 
 #Accession numbers and size of 16S.
 import re
 infilename = 'SSUtestdata.txt'
 outfilename = 'SSUtestdata3.txt'

 #Here I'm trying to search for one of the nucleotides, an end of line character and     another nucleotide, trying to make a long string.

 replace = re.compile(r'([A|C|G|U])(\n)([A|C|G|U])')

 #remove extra letters and spaces
 with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
     for line in infile:
          line = replace.sub(r'\1\3', line)

 #Write to OutFile
          outfile.write(line)   

感谢您提出任何想法!

2 个答案:

答案 0 :(得分:2)

如果我理解你的问题,应该这样做:

with open('path/to/input') as infile:
  while 1:
    try:
      line = infile.readline()
      _, start, end = line.strip().split('.')
      start, end = int(start), int(end)
      beg = infile.read(start-1)
      infile.read(beg.count('\n'))
      seq = infile.read(end-start)
      extra = infile.read(seq.count('\n'))
      seq = seq.replace('\n') + extra
      print seq  # print(seq) in python3
    except:
      break

答案 1 :(得分:1)

也许是这样的,虽然不像@ inspectorG4dget的解决方案那么优雅。

with open(infilename) as infile:
    nucStart=69
    nucStop=497
    nucleotides=[]
    for line in infile:
        if line.startswith(">"): 
            # process the previous list if populated
            if len(nucleotides) > 0:
                nucleotides = ''.join(nucleotides)  # make a single string
                # write out the accession information and the nucleotides we want
                outfile.write("%s %s" % (accession_line,
                                         nucleotides[nucStart-start-1:nucStop-start]))
                nucleotides=[]   # clear it for the next run
            # this is the start of the next sequence
            accession_line = line
            start = int(line.split('.')[1])
        else:
            # this is a line containing a partial nucleotide sequence, so add it
            nucleotides.append(line)