获取与两个fastq文件不同的记录

时间:2015-06-19 16:14:44

标签: python bioinformatics biopython

我有2个fastq文件F1.fastq和F2.fastq。 F2.fastq是一个较小的文件,它是F1.fastq的读取子集。我希望在F1.fastq中读取不在F2.fastq中的内容。以下python代码似乎不起作用。你能建议编辑吗?

needed_reads = []

reads_array = []

chosen_array = []

for x in Bio.SeqIO.parse("F1.fastq","fastq"):

        reads_array.append(x)

for y in Bio.SeqIO.parse("F2.fastq","fastq"):

        chosen_array.append(y)

for y in chosen_array:

        for x in reads_array:

                if str(x.seq) != str(y.seq) : needed_reads.append(x)

output_handle = open("DIFF.fastq","w")

SeqIO.write(needed_reads,output_handle,"fastq")

output_handle.close()

1 个答案:

答案 0 :(得分:2)

您可以使用集合来完成您的要求,您可以将list1转换为set,然后将list2转换为set,然后转换set(list1) - set(list2),它将在list1中提供不在list2中的项目。

示例代码 -

needed_reads = []

reads_array = []

chosen_array = []

for x in Bio.SeqIO.parse("F1.fastq","fastq"):

        reads_array.append(x)

for y in Bio.SeqIO.parse("F2.fastq","fastq"):

        chosen_array.append(y)

needed_reads = list(set(reads_array) - set(chosen_array))

output_handle = open("DIFF.fastq","w")

SeqIO.write(needed_reads,output_handle,"fastq")

output_handle.close()