Question

当前，我正在处理大量的BLASTn分析，我具有表格格式的输出（参数-outfmt 6），其中包含17万多行。在该链接中：https://textuploader.com/11krp包含一个示例，其中包含我想要的信息：查询，主题，主题开始，主题结束和得分（按此顺序）。

如我们所见，不同的查询可以与同一主题在相同位置或不同位置进行匹配。

在接下来的步骤中，我将使用开始和结束的位置来提取对象的那些区域，但是如果我使用此类信息进行提取，那么我将恢复很多冗余序列。

在我看来，有4种重复匹配的情况：

1-主题的相同区域=相同的s_start和相同的s_end；不同的分数；

例如第29、33、37和43行

2-主题1的几乎相同区域= s_start不同，s_end等于不同的分数；

例如第26行（s_start = 928719），18、30、34、38（s_start = 928718）

3-主题2的几乎相同区域= s_start相等，s_end不同，得分不同

例如第18、30、34、38行（s_end = 929459）和44行（s_end = 929456）。

4-情况四，相同区域的不同长度= s_tart和s_end不同，但覆盖相同的主题区域，分数不同。

例如第17行（s_start = 922442，s_end = 923192），29，33，37，43（s_tart = 922444，s_end = 923190）

所以...我对Python有一点经验，并编写了以下脚本：

import csv
# openning file
with open('blast_test.csv') as csv_file:
    subject_dict = {} # dictionary to store original informations
    subject_dict_2 = {} #dictionary to store filtred informations
    csv_reader = csv.reader(csv_file, delimiter=',')
# creating a dictionary with subjects information
    #reading file line by line
    for row in csv_reader:
        print(row)
        #atribuiting each column to one variable, modfying the name of subject
        query,subject_old, subject_new, s_start, s_end, score = row[0],row[1],row[1]+'_'+row[2]+'_'+row[3], row[2], row[3], row[4]
        # inserting subjects in a dictionary
        subject_dict[subject_new] = [subject_old, query, s_start, s_end]
        #
#testing dictionary
for k,v in subject_dict.items():
    print(k,':',v)

making comparisons
for k,v in subject_dict.items():
#    if 

'''                        
# creating an output
with open('blast_test_filtred.csv', mode='w') as csv_file:
    writer = csv.writer(csv_file, delimiter=',')
    for subject in subject_dict:
        writer.writerow([subject, s_start, s_end, score, query)])
'''

我的逻辑是：

1-创建包含所有信息的字典，更改主题名称（以方便我对输出的理解）

2-使用上述四个案例的标准删除冗余信息；

3-在输出文件中写入新信息。

要删除此冗余信息，我想在每个区域（开始和结束）的上下游建立一个10个核苷酸的阈值，然后使用主题原始名称（subject_old）比较区域，然后选择带有最佳分数（以恢复所有不同区域的方式）。

任何人都可以向我解释如何执行上述步骤吗？

谢谢。

在BLAST输出上使用冗余信息

0 个答案: