我有2个文件,一个是fasta
文件,另一个是fastq
文件。我想取fasta
,读取每一行并搜索fastq
文件中的每一行并打印顶行和底行。这就是我所拥有的
fasta文件
READ1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG
for seq in `cat sequences`;do grep -A 2 -B 1 $seq FAP.1.txt;done
@ DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173 @ 8815BDDB ###############
@ DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+ ;?@ DBD< @@ FFHHH<
@ DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
@@@ FFFFFHGHHHGIJHFDDDDDBDD69 @ 6B-707537BDDDB75 @@ 85
@ DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
@ CCFFFFFHHHHHJJJHFDDD @ 77BDDDDB077007 @乙###########
从这里我们可以看到AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
出现了两次,但我想只打印一次。我怎么能这样做?
最终输出文件
@ DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173 @ 8815BDDB ###############
@ DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
; @ DBD< @@ FFHHH<
@ DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
+
@ CCFFFFFHHHHHJJJHFDDD @ 77BDDDDB077007 @乙
答案 0 :(得分:0)
所以听起来你想要一个只有唯一序列的fastq?
这是一种非常低效的方法,但应该可行。它将您的fastq文件存储为列表,所以希望它不会太大。它只会抛出重复的序列,而不是质量分数或任何东西。
fastqFile = list(open(fastq))
out = []
output = open('output.fastq', 'at')
for lineNum, line in enumerate(fastqFile):
if lineNum < 4:
out.append(line)
output.write(line)
else:
if line not in out and lineNum % 4 != 3:
output.write(fastqFile[lineNum - 1])
output.write(line)
output.write(fastqFile[lineNum + 1])
output.write(fastqFile[lineNum + 2])
out.append(fastqFile[lineNum - 1])
out.append(line)
out.append(fastqFile[lineNum + 1])
out.append(fastqFile[lineNum + 2])
答案 1 :(得分:0)
我想我知道你想说的是什么,所以这是我的代码。它只会按照要求首次出现fasta序列。这可能不是最好的方法,但我是python的新手。
# open the file into a list
fasta = open('fasta1.fa', 'r').read().splitlines()
fastq = open('fastq1.fq', 'r').read().splitlines()
# get rid of headers
# if headers important, please indicate in example
fastaseq = [s for s in fasta if not any('>' in t for t in s)]
# get rid of whitespace
fastaseq = filter(None, fastaseq)
fastq = filter(None, fastq)
# new list
newfastq = []
# go through each item in your fasta list
# if it matches, get the line above and below
# put in the new list
for fa in fastaseq:
if fa in fastq:
ind = fastq.index(fa)
printblock = fastq[ind-1:ind+2]
elif fa not in fastq:
printblock = []
if printblock:
newfastq.append(printblock)
# print everything to file
with open('fastq2.fq', 'w') as f:
for block in newfastq:
for item in block:
f.writelines(item + '\n')
答案 2 :(得分:0)
除非您有充分理由自己这样做,否则请使用Biopython。
FASTA:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGG
fastq(基于您的但不完全相同,因为您的输出格式错误):
@DH1DQQN1:269:C1UKCACXX:1:1107:20386:6577 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1114:5718:53821 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
CCCFFFFFHGHHHJIJHFDDDB173@8815BDDB###############
@DH1DQQN1:269:C1UKCACXX:1:1209:10703:35361 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
@@@FFFFFHGHHHGIJHFDDDDDBDD69@6B-707537BDDDB75@@85
@DH1DQQN1:269:C1UKCACXX:1:1210:18926:75163 1:N:0:TTAGGC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
+
@CCFFFFFHHHHHJJJHFDDD@77BDDDDB077007@B###########
代码:
from Bio import SeqIO
with open("fasta") as fh:
fasta = fh.read().splitlines()
seen = set()
for record in SeqIO.parse(open('fastq'), 'fastq'):
seq = str(record.seq)
if seq in fasta and seq not in seen:
seen.add(seq)
print record.format('fastq')
编辑:以上按照fastq文件的顺序打印记录,而不是fasta文件。如果订单不重要,您应该使用该方法。否则,您可以将记录添加到字典中,其中键是FASTA文件中的索引,并最后将它们全部打印出来,对字典进行排序:
from Bio import SeqIO
import sys
with open("fasta") as fh:
fasta = fh.read().splitlines()
seen = set()
records = {}
for record in SeqIO.parse(open('fastq'), 'fastq'):
seq = str(record.seq)
if seq in fasta and seq not in seen:
seen.add(seq)
records[fasta.index(seq)] = record
for record in sorted(records):
sys.stdout.write(records[record].format('fastq'))
(此处我还使用sys.stdout.write
代替print
,以避免额外换行。)