Question

我在 peptides.txt 中有一个肽序列文件，我想将其与我的蛋白质数据库 human_proteins.fasta 匹配。我想将肽列表与蛋白质数据库相匹配，并获取前一行中的蛋白质 ID。一些肽段与蛋白质数据库有多重匹配。

最终，我想生成一个这样的表/数据框：

<头>

肽	没有。的比赛	序列号	蛋白质序列
AAAAA	2	ENST0001	AAAAABCFMED
AAAAA	2	ENST0002	AAAAAXXX

我假设的蛋白质数据库 human_proteins.fasta 的前几行如下所示：

<块引用>

>ENST0001
AAAAABCFMED
>ENST0002
AAAAXXX
>ENST0003
MGRVSGLVPSR

peptides.txt 看起来像这样：

AAAAA
LSSPATLNSR
HETLTSLNLEK
GGGGNFGPGPGSNFR
VSEQGLIEILK
DFLAGGIAAAISK

我在 bash 中使用以下命令

while read line; do printf $line grep -B 1 $line ../databases/human_proteins.fasta < peptides.txt

我能够得到这样的输出：

>ENST0001
AAAAABCFMED
>ENST0002
AAAAAXXX

但是，我无法将输出处理到表格中。在 unix/bash 中是否有一个很好的解决方案可以解决这个问题？

Answer 1

用python做怎么样：

import pandas as pd
import re

seq = {}
peptide = {}
matches = {}
result = []

with open("human_proteins.fasta") as f:
    while True:
        id = f.readline().rstrip().lstrip(">")
        if not id: break
        protein = f.readline().rstrip()
        seq[id] = protein
        matches[id] = 0

with open("peptides.txt") as f:
    for i in f:
        i = i.rstrip()
        for id in seq:
            protein = seq[id]
            if re.match(i, protein):
                peptide[id] = i
                matches[id] += 1

for id in seq:
    if matches[id]:
        result.append([peptide[id], matches[id], id, seq[id]])

df = pd.DataFrame(result)
df.columns = ["Peptide", "No. of matches", "Sequence ID", "Protein Sequence"]

print(df)

提供的文件的输出：

  Peptide  No. of matches Sequence ID Protein Sequence
0   AAAAA               1    ENST0001      AAAAABCFMED
1   AAAAA               1    ENST0002         AAAAAXXX

请注意，我假设 No. of matches 列表示相同 Protein Sequence 的计数。如果它应该是相同 Peptide 的计数（那么上面的值将是 2），请告诉我。

[编辑]
如果 No. of matches 的列指的是在 fasta 文件中找到的 peptide 的计数，这里有一个替代方案：

import pandas as pd
import re

seq = {}
peptide = {}
matches = {}
result = []

with open("human_proteins.fasta") as f:
    while True:
        id = f.readline().rstrip().lstrip(">")
        if not id: break
        protein = f.readline().rstrip()
        seq[id] = protein

with open("peptides.txt") as f:
    for i in f:
        i = i.rstrip()
        matches[i] = 0
        for id in seq:
            protein = seq[id]
            if re.match(i, protein):
                peptide[id] = i
                matches[i] += 1

for id in peptide:
    result.append([peptide[id], matches[peptide[id]], id, seq[id]])

df = pd.DataFrame(result)
df.columns = ["Peptide", "No. of matches", "Sequence ID", "Protein Sequence"]

print(df)

如何将字符串列表匹配到另一个文件并打印多个匹配的前一行？

1 个答案: