(我正在使用python)
我正在使用大量的RNA序列文件,我正在尝试将其重新格式化以用于群集程序。我的文件由两种“行”组成。 1)细菌的登录号,(期间)该序列开始的核苷酸,(期间)其结束的核苷酸。 2)实际序列本身的行(跨越多行,即使它是连续的序列):
> A45315.1.1521 \ n GACGAACGCUGGCGGCGUGCCUAAUACAUGCAAGUCGAGCGCAGGAAGCCGGCGGAUCCC \ n UUCGGGGUGAANCCGGUGGAAUGAGCGGCGGACGGGUGAGUAACACGUGGGCAACCUACC \ n UUGUAGACUGGGAUAACUCCGGGAAACCGGGGCUAAUACCGGAUGAUCAUUUGGAUCGCAU \ n GAUCCGAAUGUAAAAGUGGGGAUUUAUCCUCACACUGCAAGAUGGGCCCGCGGCGCA ... .. > A93610.15.1301 \ n CCACUGCUAUGGGGGUCCGACUAAGCCAUGCGAGUCAUGGGGUCCCUCUGGGACACCACC \ n GGCGGACGGCUCAGUAACACGUCGGUAACCUACCCUCGGGAGGGGGAUAACCCCGGGAAA \ n CUGGGGCUAAUCCCCCAUAGGCCUGAGGUACUGGAAGGUCCUCAGGCCGAAAGGGGCUU ...
我需要创建一些查看以>开头的行的内容,并转到第一个小数后面的数字(因此高于1和15)。从这个数字开始计数(在上面的例子中是1或15),它需要提取从69开始的核苷酸(As,Cs,Gs或Us)并转到497(注意这个例子我拿了一堆(核苷酸)。
因此,对于我的尝试,我认为将核苷酸序列制成一个长串是有意义的,然后尝试提取核苷酸。但我似乎无法将RNA序列的行分成一个长串(见下文我试过的)。一旦我有大字符串,我不知道如何提取正确的核苷酸。我需要编写类似s = [x:497]的内容,其中x是69-(在第一个小数之前插入该数字)。
#!/usr/bin/env python
#Make a program that takes SSURef_NR99 file of sequences, makes a new file of
#Accession numbers and size of 16S.
import re
infilename = 'SSUtestdata.txt'
outfilename = 'SSUtestdata3.txt'
#Here I'm trying to search for one of the nucleotides, an end of line character and another nucleotide, trying to make a long string.
replace = re.compile(r'([A|C|G|U])(\n)([A|C|G|U])')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
line = replace.sub(r'\1\3', line)
#Write to OutFile
outfile.write(line)
感谢您提出任何想法!
答案 0 :(得分:2)
如果我理解你的问题,应该这样做:
with open('path/to/input') as infile:
while 1:
try:
line = infile.readline()
_, start, end = line.strip().split('.')
start, end = int(start), int(end)
beg = infile.read(start-1)
infile.read(beg.count('\n'))
seq = infile.read(end-start)
extra = infile.read(seq.count('\n'))
seq = seq.replace('\n') + extra
print seq # print(seq) in python3
except:
break
答案 1 :(得分:1)
也许是这样的,虽然不像@ inspectorG4dget的解决方案那么优雅。
with open(infilename) as infile:
nucStart=69
nucStop=497
nucleotides=[]
for line in infile:
if line.startswith(">"):
# process the previous list if populated
if len(nucleotides) > 0:
nucleotides = ''.join(nucleotides) # make a single string
# write out the accession information and the nucleotides we want
outfile.write("%s %s" % (accession_line,
nucleotides[nucStart-start-1:nucStop-start]))
nucleotides=[] # clear it for the next run
# this is the start of the next sequence
accession_line = line
start = int(line.split('.')[1])
else:
# this is a line containing a partial nucleotide sequence, so add it
nucleotides.append(line)