我刚刚开始使用这个新程序,它以奇数格式输出,如下所示:
CRISPR 10 Range: 7784249 - 7784543
POSITION REPEAT SPACER
-------- -------------------------------- ---------------------------------
7784249 GTTTCAATCCACGCCCCCGCATGGGGGGCGAC GTTAAGATTTTCAGCCGAAGCATAAGACTGCTCA [ 32, 34 ]
7784315 GTTTCAATCCACGCCCCCGCATGGGGGGCGAC ATCAATAACAATACCTTGCTTTTCAGTTTCATT [ 32, 33 ]
7784380 GTTTCAATCCACGCCCCCGCATGGGGGGCGAC TATAACTTTCTCCTTCTATTGTTGATGTAACATA [ 32, 34 ]
7784446 GTTTCAATCCACGCCCCCGCATGGGGGGCGAC TTTTCATTTGCATCAAGTTCTTTTTCAAGGTCAA [ 32, 34 ]
7784512 GTTTCAATCCACGCCCCCG>CONTIG-97480
-------- -------------------------------- ---------------------------------
Repeats: 5 Average Length: 32 Average Length: 33
CRISPR 11 Range: 8822044 - 8822520
POSITION REPEAT SPACER
-------- ------------------------------------- ------------------------------------
8822044 GTGTCAATGCCCTATATCGGGCGCACTTCATTTCTAC TTTACCAATCTCGGCTCTTTACTCCCGCTGGGTGCATT [ 37, 38 ]
8822119 GTGTCAATGCCCTATATCGGGCGCACTTCATTTCTAC TTAAAGCAGATACAAAGAAGCCTTGTGAGGAATATT [ 37, 36 ]
8822192 GTGTCAATGCCCTATATCGGGCGCACTTCATTTCTAC TATACTTCAGAAGTGCTGAGTTCCAGAAGCTTTTT [ 37, 35 ]
8822264 GTGTCAATGCCCTATATCGGGCGCACTTCATTTCTAC AAATATATGATTAATAATAAGAATAATCAAATAGTA [ 37, 36 ]
8822337 GTGTCAATGCCCTATATCGGGCGCACTTCATTTCTAC TTTCGTGGTTCCATCTGCTTATGAAACATTATTGATCT [ 37, 38 ]
8822412 GTGTCAATGCCCTATATCGGGCGCACTTCATTTCTAC GGATGAGGCTGGTACATATACGTACCTGGTTCTTC [ 37, 35 ]
8822484 GTGTCAATGCCCTATATCGGGCGCACTTCAT>CONTI
-------- ------------------------------------- ------------------------------------
Repeats: 7 Average Length: 37 Average Length: 36
我想知道如何只选择第三列字符串并将它们打印到新文件中。我还想为每个部分提供一个标题以及下一个程序,例如,第一行' CRISPR 11'的输出:
>CRISPR_11_8822044_8822520_1
TTTACCAATCTCGGCTCTTTACTCCCGCTGGGTGCATT
这样的名字就是由'>'组成的。然后是CRISPR的编号,那么范围值和最终编号就是它的顺序,例如,它是1,因为它是这一组中的第一个。
我知道如何写入文件,但不知道如何选择文件的相关部分。
任何帮助都会很棒。
答案 0 :(得分:0)
答案,就像NewWorld的评论一样,只是一堆正则表达式,拆分和列表解析。虽然我确信有更简洁明了的方法可以做到这一点,但这就是我在你的位置上做到的。
此代码逐行贯穿输入文件
我将您的文本保存到文件 example.txt 中,然后运行它。
import re
secondbases = []
with open('/example.txt','r') as contents:
for line in contents:
splitline = line.split()
if line[0] == 'C': #If line begins CRISPR, write new info to file and save new CRISP values
if secondbases != []:
outputfile = open('/output.txt','a') #With ain't workin for me here, but replace at your own desire
x = 1
for base in secondbases:
outputfile.write('>CRISPR_' + crispno + '_' + rangestart + '_' + rangeend + '_' + str(x) + ' ' + base + '\n')
x += 1
outputfile.close()
secondbases = []
(crispno,rangestart,rangeend) = (splitline[1],splitline[3],splitline[5])
elif re.search('[0-9]',line[0]): #If base lines, copy second base string to list
bases = []
for a in splitline:
if re.search('[GTCA]{5,}',a):
bases = bases + [a]
if len(bases) > 1:
secondbases = secondbases + [bases[1]]
elif line[len(line)-1] != '\n' and secondbases != []:
outputfile = open('/output.txt','a')
x = 1
for base in secondbases:
outputfile.write('>CRISPR_' + crispno + '_' + rangestart + '_' + rangeend + '_' + str(x) + ' ' + base + '\n')
x += 1
outputfile.close()
哪个输出了包含
的文件>CRISPR_10_7784249_7784543_1 GTTAAGATTTTCAGCCGAAGCATAAGACTGCTCA
>CRISPR_10_7784249_7784543_2 ATCAATAACAATACCTTGCTTTTCAGTTTCATT
>CRISPR_10_7784249_7784543_3 TATAACTTTCTCCTTCTATTGTTGATGTAACATA
>CRISPR_10_7784249_7784543_4 TTTTCATTTGCATCAAGTTCTTTTTCAAGGTCAA
>CRISPR_11_8822044_8822520_1 TTTACCAATCTCGGCTCTTTACTCCCGCTGGGTGCATT
>CRISPR_11_8822044_8822520_2 TTAAAGCAGATACAAAGAAGCCTTGTGAGGAATATT
>CRISPR_11_8822044_8822520_3 TATACTTCAGAAGTGCTGAGTTCCAGAAGCTTTTT
>CRISPR_11_8822044_8822520_4 AAATATATGATTAATAATAAGAATAATCAAATAGTA
>CRISPR_11_8822044_8822520_5 TTTCGTGGTTCCATCTGCTTATGAAACATTATTGATCT
>CRISPR_11_8822044_8822520_6 GGATGAGGCTGGTACATATACGTACCTGGTTCTTC
只需确保有一个名为 output.txt 的文件或其他任何更改的文件,否则会有异常等等。