Question

我有一个标准格式的BLAST outfmt 6输出文件，我想找到一种循环文件的方法，选择每个匹配，找到它的倒数和解密，这是最好的存储点。

例如：

d = {}
for line in input_file:
    term = line.split('\t')
    qseqid = term[0]
    sseqid = term[1]
    hit = qseqid, sseqid
    recip_hit = sseqid, qseqid
    for line in input_file:
        if recip_hit in line:
            compare both lines
done

示例输入（制表符分隔）：

Seq1    Seq2    80    1000   10    3   1    1000    100    1100    0.0    500
Seq2    Seq1    95    1000   10    3   100    1100    1    1000    1e-100    500

任何人都可以提供有关如何有效解决此问题的任何见解吗？

非常感谢提前

Answer 1

您可以解决问题，找到这些对并比较这样的行：

#create a dictionary to store pairs
line_dict = {}
#iterate over your file
for line in open("test.txt", "r"):
    line = line[:-1].split("\t")
    #ignore line, if not at least one value apart from the two sequence IDs
    if len(line) < 3:
        continue
    #identify the two sequences
    seq = tuple(line[0:2])
    #is reverse sequence already in dictionary?
    if seq[::-1] in line_dict:
        #append new line
        line_dict[seq[::-1]].append(line)
    else:
        #create new entry
        line_dict[seq] = [line]

#remove entries, for which no counterpart exists
pairs = {k: v for k, v in line_dict.items() if len(v) > 1}

#and do things with these pairs
for pair, seq in pairs.items():
    print(pair, "found in:")
    for item in seq:
        print(item)

优点是您只需要对文件进行一次迭代，因为如果您找不到匹配的反向对，则只存储所有数据并将其丢弃。缺点是这占用空间，因此对于非常大的文件，这种方法可能不可行。

类似的方法 - 将所有数据存储在工作内存中 - 使用pandas。这应该更快，因为排序算法针对熊猫进行了优化。 pandas的另一个优点是所有其他值都已存在于pandas列中 - 因此可以更轻松地进行进一步分析。我绝对更喜欢熊猫版本，但我不知道，如果它安装在你的系统上。为了便于沟通，我将a和b分配给包含序列Seq1和Seq2的列。

import pandas as pd
#read data into a dataframe
#not necessary: drop the header of the file, use custom columns names
df = pd.read_csv("test.txt", sep='\t', names=list("abcde"), header = 0)

#create a column that joins Seq1 - Seq2 or Seq2 - Seq1 to Seq1Seq2
df["pairs"] = df.apply(lambda row: ''.join(sorted([row["a"], row["b"]])), axis = 1)
#remove rows with no matching pair and sort the database
only_pairs = df[df["pairs"].duplicated(keep = False)].sort_values(by = "pairs")

print(only_pairs)

使用python在单个BLAST文件中查找最佳互惠命中

1 个答案: