Question

我正在尝试为用户打印蛋白质序列中最常见的氨基酸（例如：如果用户输入AHEHD，则最常见的AA为H，出现2次）

当前，我使用Counter和most.common（），在大多数情况下都可以使用：

sequence=input("\n" + "\033[1;34;40mHello, and Welcome! Please enter your sequence:").upper()
AA_count=Counter(sequence) 
AA_mostfrequent=AA_count.most_common(1)

打印时：

for key, value in AA_mostfrequent:
                print("\n",key, "\033[1;35;40mis the most common amino acid in your sequence, appearing", value, "time(s)!", sep=" ")

但是，假设我有一个序列，其中某些AA出现的频率与其他序列相同（例如：ADEH或AAAAADEEEEE）。

在这种情况下，程序将任意选择要打印的任何AA。（例如：使用ADEH，只会说D出现在我的序列中最多，出现1次）

我不知道每个AA在给定序列中出现多少次。规则是，我可以提供所需的任何有效蛋白质序列，无论其长度如何，只要它指出哪种氨基酸是最常见的即可。

***用于查找每个AA的频率：

AA_total=len(sequence)
for key, value in sorted(AA_count.items()):
              print(key,value/AA_total, sep=":")

Answer 1

一旦掌握了最常见的AA计数，就可以循环计数器并选择计数最高的计数器：

AA_count=Counter('GATTACAT') 
AA_most_common=AA_count.most_common(1)
most_common = [AA for AA, ct in AA_count.items() if ct == AA_most_common[0][1]]

print(most_common)
>>> ['A', 'T']

Answer 2

如果要获取具有相同且最频繁出现的所有酸的列表，可以使用：

[(acid, cnt) for acid, cnt in AA_count.items() if cnt == AA_count.most_common(1)[0][1]]

此表达式的结果（您将分配给AA_mostfrequent）的格式与您当前期望的格式相同-酸计数对列表。

说明：

[(acid, cnt) for                  # 4) acid-cnt pair that satisfies those conditions will be in result
acid, cnt in AA_count.items()     # 1) you loop over all pairs of acid-count
if cnt ==                         # 2) and take only those pairs, where cnt part is equal to
    AA_count.most_common(1)       # 3)result of (list of) 1 most common acid-cnt pair
        [0]                       # 3.1)from this list you take the only element
            [1]]                  # 3.2)from this element (acid-cnt pair) only cnt part

按给定序列以相同频率打印所有氨基酸

2 个答案: