Question

我有几千行数据，如下所示：

TTGGGG**TCTCCAT**  
TTCTTC**TCTCCAT**  
TTGGGG**TCTCCAT**  
TTCTTC**TCTCCAT**  
TATTAT**TCTCCAT**

我想对数据进行分组和计数，以获得如下输出：

TTGGGG**TCTCCAT** - 2  
TTGGGG**TCTCCAT** - 2  
TATTAT**TCTCCAT** - 1

由于粗体字符前面的6个字符是随机的，我不知道如何在python中编写代码。

Answer 1

from collections import Counter
with open('path/to/input') as infile:
    counts = collections.Counter(line.strip() for line in infile)
for seq, count in counts.items():
    print(seq, '-', count)

以上解决方案使用collections.Counter 另一方面，如果您不想使用标准库中内置的帮助程序，那么您可以执行以下相同的结果：

counts = {}
with open('path/to/input') as infile:
    for line in infile:
        seq = line.strip()
        if seq not in counts:
            counts[seq] = 0
        counts[seq] += 1
    for seq, count in counts.items():
        print(seq, '-', count)

Answer 2

第一种方法：

示例：

>>[1, 2, 3, 4, 1, 4, 1].count(1)
3

因此在你的情况下：

>>['TTGGGG**TCTCCAT**','TTCTTC**TCTCCAT**','TTGGGG**TCTCCAT**','TTCTTC**TCTCCAT**','TATTAT**TCTCCAT**'].count('TTGGGG**TCTCCAT**')

第二种方法：

>>> from collections import Counter
>>> z = ['TTGGGG**TCTCCAT**',TTCTTC**TCTCCAT**',TTGGGG**TCTCCAT**','TTCTTC**TCTCCAT**','TATTAT**TCTCCAT**']
>>> Counter(z)
Counter({'TTGGGG**TCTCCAT**':2, 'TTGGGG**TCTCCAT**': 2, 'TATTAT**TCTCCAT**': 1})

如何分组和计算随机字符串？

2 个答案: