Question

我有一个带有风味肽及其氨基酸序列的dataFrame，我正在尝试计算每种氨基酸的出现并将其存储在新的dataFrame 中。首先，我只是使用只有5行的非常小的dataFrame。对于我的实际dataFrame来说，序列可以长于1个字符，例如，如果字符串是：'RPFFLR'，那么我希望它计数：2 * F，1 * L，1 * P和2 * R 。

这是初始dataFrame：

    ID               Name Sequence
0      1  bitter amino acid        R
3      4  bitter amino acid        P
6      7  bitter amino acid        F
36   172  bitter amino acid        L
438  105  bitter amino acid        V

我有以下代码：

def countAA(Bseq, ref):
    countF = [0]
    countL = [0]
    countP = [0]
    countR = [0]
    countV = [0]
    Bseq = Baadata.Sequence
    ref = 'ADEFGHIKLMPQRSTVWY'
    for i in Bseq:
        for c in ref:
            if ref[4] in Bseq:
                countF += 1
            elif ref[9] in Bseq:
                countL += 1
            elif ref[11] in Bseq:
                countP += 1
            elif ref[13] in Bseq:
                countR += 1
            elif ref[16] in Bseq:
                countV += 1
    return [countF, countL, countP, countR, countV]

Bseq = Baadata.Sequence
for i in Bseq:
    ref = 'ADEFGHIKLMPQRSTVWY'
    Baa = countAA(Bseq, ref)

Bdf = pd.DataFrame((Baa),
                   index=['F', 'L', 'P', 'R', 'V'],
                   columns=['Bitter']
                   )
print(Bdf)

对于这种小的输入，预期的输出将是：

     Bitter
F    1
L    1
P    1
R    1
V    1

由于我的代码未计算字符，我在做什么错了？

Answer 1

您确定DataFrame是输出的最佳结构吗？如果只想对Sequences列中的每个字符进行计数，则可以使用Counter很容易地做到这一点：

from collections import Counter
Bdf = Counter("".join(Baadata.Sequence))

示例

Baadata = pd.DataFrame(["asd", "fdf", "s", "xxxxxxx"], columns=['Sequence'])
Counter("".join(Baadata.Sequence))

输出

Counter({'a': 1, 's': 2, 'd': 2, 'f': 2, 'x': 7})

Answer 2

也许这会起作用：

1）首先根据“名称”和“序列”对数据进行分组（我假设您只有几个序列）
df = df.groupby(['Name', 'Sequence']).count().reset_index()

2）随后旋转桌子以获得您想要的结果
df.pivot(index='Sequence', columns='Name', values='ID')

Python：如何将函数结果输出到数据框

2 个答案: