Question

我正在使用Python 2.6.6，我正在尝试删除fastq中与file2中重叠（即相同）读取的file1次读取。这是我试图实现的代码：

ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")

for spk in spk_reads:
    if spk in ref_reads:
    del ref_reads[spk]

但是，我收到与使用del：

相关的错误

AttributeError：_IndexedSeqFileDict实例没有属性＆＃39; __ delitem __＆＃39;

是否可以使用现有配方删除项目？如何从使用SeqIO.index()？

生成的字典中删除项目

我也尝试了以下内容：

# import read data
ref_reads = SeqIO.index("main.fastq", "fastq")
spk_reads = SeqIO.index("over.fastq", "fastq")

# note that ref_reads.keys() doesn't return a list but a 'dictionary-       keyiterator', 
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())  
spk_keys = set(spk_reads.keys())

# loop to remove overlap reads
for spk in spk_keys:
    if spk in ref_keys:
        del ref_keys[spk]

# output data
output_handle = open(fname_out, "w")
SeqIO.write(ref_reads[ref_keys], output_handle, "fastq")
output_handle.close()

Answer 1

SeqIO.index（）没有返回真正的字典，但是a dictionary like object, giving the SeqRecord objects as values：

请注意，此伪字典不支持a的所有方法真正的Python字典，例如自此没有定义values（）需要立即将所有记录加载到内存中。

此对象字典是_IndexedSeqFileDict实例。 docstring提到：

请注意，此词典基本上是只读的。你不能添加或更改值，弹出值，也不清除字典。

因此，您需要使用SeqIO.parse()和SeqIO.to_dict()将fastq文件转换为内存中的Python字典：

from Bio import SeqIO

ref_reads = SeqIO.parse("file1.fastq", "fastq")
spk_reads = SeqIO.parse("file1.fastq", "fastq")

ref_reads_dict = SeqIO.to_dict(ref_reads)

for spk in spk_reads:
    if spk.id in ref_reads_dict:
        del ref_reads_dict[spk.id]

如果您的文件太大而无法使用SeqIO.parse()，那么我会做这样的事情：

from Bio import SeqIO

ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")

# note that ref_reads.keys() doesn't return a list but a 'dictionary-keyiterator', 
# so we turn it into a set to work with it
ref_keys = set(ref_reads.keys())  
spk_keys = set(spk_reads.keys())

unique_ref_keys = ref_keys - spk_keys

# this step might take a long time if your files are large
unique_ref_reads = {key: ref_reads[key] for key in unique_ref_keys}

编辑，回答您的评论：

如何再次解决从SeqIO.index删除项目的原始问题（＆＃34; file1.fastq＆＃34;，＆＃34; fastq＆＃34;）？

就像我上面所述，SeqIO.index("file1.fastq", "fastq")返回一个只读_IndexedSeqFileDict对象。因此，无法，按设计从中删除项目。

下面的更新代码显示了如何创建删除重叠读取的新fastq文件。

如果您确实需要新的SeqIO.index()对象，则可以使用SeqIO.index()再次阅读此文件。

from Bio import SeqIO

ref_reads = SeqIO.index("file1.fastq", "fastq")
spk_reads = SeqIO.index("file2.fastq", "fastq")

ref_keys = set(ref_reads.keys())  
spk_keys = set(spk_reads.keys())

unique_ref_keys = ref_keys - spk_keys

# conserve memory by using a generator expression
unique_ref_records = (ref_reads[key] for key in unique_ref_keys)

# output new file with overlapping reads removed
with open(fname_out, "w") as output_handle:
    SeqIO.write(unique_ref_records , output_handle, "fastq")

# optionally, create a new SeqIO.index() object 
unique_ref_reads = SeqIO.index(fname_out, "fastq")

从SeqIO.index生成的字典中删除项目

1 个答案: