Question

（我正在使用python）

我正在使用大量的RNA序列文件，我正在尝试将其重新格式化以用于群集程序。我的文件由两种“行”组成。 1）细菌的登录号，（期间）该序列开始的核苷酸，（期间）其结束的核苷酸。 2）实际序列本身的行（跨越多行，即使它是连续的序列）：

＆GT; A45315.1.1521 \ n GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGCAGGAAGCCGGCGGAUCCC \ n UUCGGGGUGAANCCGGUGGAAUGAGCGGCGGACGGGUGAGUAACACGUGGGCAACCUACC \ n UUGUAGACUGGGAUAACUCCGGGAAACCGGGGCUAAUACCGGAUGAUCAUUUGGAUCGCAU \ n GAUCCGAAUGUAAAAGUGGGGAUUUAUCCUCACACUGCAAGAUGGGCCCGCGGCGCA ... .. ＆GT; A93610.15.1301 \ n CCACUGCUAUGGGGGUCCGACUAAGCCAUGCGAGUCAUGGGGUCCCUCUGGGACACCACC \ n GGCGGACGGCUCAGUAACACGUCGGUAACCUACCCUCGGGAGGGGGAUAACCCCGGGAAA \ n CUGGGGCUAAUCCCCCAUAGGCCUGAGGUACUGGAAGGUCCUCAGGCCGAAAGGGGCUU ...

我需要创建一些查看以＆gt;开头的行的内容，并转到第一个小数后面的数字（因此高于1和15）。从这个数字开始计数（在上面的例子中是1或15），它需要提取从69开始的核苷酸（As，Cs，Gs或Us）并转到497（注意这个例子我拿了一堆（核苷酸）。

因此，对于我的尝试，我认为将核苷酸序列制成一个长串是有意义的，然后尝试提取核苷酸。但我似乎无法将RNA序列的行分成一个长串（见下文我试过的）。一旦我有大字符串，我不知道如何提取正确的核苷酸。我需要编写类似s = [x：497]的内容，其中x是69-（在第一个小数之前插入该数字）。

 #!/usr/bin/env python
 #Make a program that takes SSURef_NR99 file of sequences, makes a new file of 
 #Accession numbers and size of 16S.
 import re
 infilename = 'SSUtestdata.txt'
 outfilename = 'SSUtestdata3.txt'

 #Here I'm trying to search for one of the nucleotides, an end of line character and     another nucleotide, trying to make a long string.

 replace = re.compile(r'([A|C|G|U])(\n)([A|C|G|U])')

 #remove extra letters and spaces
 with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
     for line in infile:
          line = replace.sub(r'\1\3', line)

 #Write to OutFile
          outfile.write(line)

感谢您提出任何想法！

Answer 1

如果我理解你的问题，应该这样做：

with open('path/to/input') as infile:
  while 1:
    try:
      line = infile.readline()
      _, start, end = line.strip().split('.')
      start, end = int(start), int(end)
      beg = infile.read(start-1)
      infile.read(beg.count('\n'))
      seq = infile.read(end-start)
      extra = infile.read(seq.count('\n'))
      seq = seq.replace('\n') + extra
      print seq  # print(seq) in python3
    except:
      break

Answer 2

也许是这样的，虽然不像@ inspectorG4dget的解决方案那么优雅。

with open(infilename) as infile:
    nucStart=69
    nucStop=497
    nucleotides=[]
    for line in infile:
        if line.startswith(">"): 
            # process the previous list if populated
            if len(nucleotides) > 0:
                nucleotides = ''.join(nucleotides)  # make a single string
                # write out the accession information and the nucleotides we want
                outfile.write("%s %s" % (accession_line,
                                         nucleotides[nucStart-start-1:nucStop-start]))
                nucleotides=[]   # clear it for the next run
            # this is the start of the next sequence
            accession_line = line
            start = int(line.split('.')[1])
        else:
            # this is a line containing a partial nucleotide sequence, so add it
            nucleotides.append(line)

组合字符串，提取子字符串

2 个答案: