我正在使用Python 2.6.6,我正在尝试删除fastq
中与file2
中重叠(即相同)读取的file1
次读取。这是我试图实现的代码:
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
for spk in spk_reads:
if spk in ref_reads:
del ref_reads[spk]
但是,我收到与使用del
:
AttributeError:_IndexedSeqFileDict实例没有属性' __ delitem __'
是否可以使用现有配方删除项目?如何从使用SeqIO.index()
?
我也尝试了以下内容:
# import read data
ref_reads = SeqIO.index("main.fastq", "fastq")
spk_reads = SeqIO.index("over.fastq", "fastq")
# note that ref_reads.keys() doesn't return a list but a 'dictionary- keyiterator',
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
# loop to remove overlap reads
for spk in spk_keys:
if spk in ref_keys:
del ref_keys[spk]
# output data
output_handle = open(fname_out, "w")
SeqIO.write(ref_reads[ref_keys], output_handle, "fastq")
output_handle.close()
答案 0 :(得分:1)
SeqIO.index()没有返回真正的字典,但是a dictionary like object, giving the SeqRecord objects as values:
请注意,此伪字典不支持a的所有方法 真正的Python字典,例如自此没有定义values() 需要立即将所有记录加载到内存中。
此对象字典是_IndexedSeqFileDict
实例。 docstring提到:
请注意,此词典基本上是只读的。你不能 添加或更改值,弹出值,也不清除字典。
因此,您需要使用SeqIO.parse()
和SeqIO.to_dict()
将fastq文件转换为内存中的Python字典:
from Bio import SeqIO
ref_reads = SeqIO.parse("file1.fastq", "fastq")
spk_reads = SeqIO.parse("file1.fastq", "fastq")
ref_reads_dict = SeqIO.to_dict(ref_reads)
for spk in spk_reads:
if spk.id in ref_reads_dict:
del ref_reads_dict[spk.id]
如果您的文件太大而无法使用SeqIO.parse()
,那么我会做这样的事情:
from Bio import SeqIO
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
# note that ref_reads.keys() doesn't return a list but a 'dictionary-keyiterator',
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
unique_ref_keys = ref_keys - spk_keys
# this step might take a long time if your files are large
unique_ref_reads = {key: ref_reads[key] for key in unique_ref_keys}
编辑,回答您的评论:
如何再次解决从SeqIO.index删除项目的原始问题(" file1.fastq"," fastq")?
就像我上面所述,SeqIO.index("file1.fastq", "fastq")
返回一个只读_IndexedSeqFileDict
对象。因此,无法,按设计从中删除项目。
下面的更新代码显示了如何创建删除重叠读取的新fastq文件。
如果您确实需要新的SeqIO.index()
对象,则可以使用SeqIO.index()
再次阅读此文件。
from Bio import SeqIO
ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")
ref_keys = set(ref_reads.keys())
spk_keys = set(spk_reads.keys())
unique_ref_keys = ref_keys - spk_keys
# conserve memory by using a generator expression
unique_ref_records = (ref_reads[key] for key in unique_ref_keys)
# output new file with overlapping reads removed
with open(fname_out, "w") as output_handle:
SeqIO.write(unique_ref_records , output_handle, "fastq")
# optionally, create a new SeqIO.index() object
unique_ref_reads = SeqIO.index(fname_out, "fastq")