我有几千行数据,如下所示:
TTGGGG**TCTCCAT**
TTCTTC**TCTCCAT**
TTGGGG**TCTCCAT**
TTCTTC**TCTCCAT**
TATTAT**TCTCCAT**
我想对数据进行分组和计数,以获得如下输出:
TTGGGG**TCTCCAT** - 2
TTGGGG**TCTCCAT** - 2
TATTAT**TCTCCAT** - 1
由于粗体字符前面的6个字符是随机的,我不知道如何在python中编写代码。
答案 0 :(得分:0)
from collections import Counter
with open('path/to/input') as infile:
counts = collections.Counter(line.strip() for line in infile)
for seq, count in counts.items():
print(seq, '-', count)
以上解决方案使用collections.Counter
另一方面,如果您不想使用标准库中内置的帮助程序,那么您可以执行以下相同的结果:
counts = {}
with open('path/to/input') as infile:
for line in infile:
seq = line.strip()
if seq not in counts:
counts[seq] = 0
counts[seq] += 1
for seq, count in counts.items():
print(seq, '-', count)
答案 1 :(得分:0)
第一种方法:
示例:
>>[1, 2, 3, 4, 1, 4, 1].count(1)
3
因此在你的情况下:
>>['TTGGGG**TCTCCAT**','TTCTTC**TCTCCAT**','TTGGGG**TCTCCAT**','TTCTTC**TCTCCAT**','TATTAT**TCTCCAT**'].count('TTGGGG**TCTCCAT**')
第二种方法:
>>> from collections import Counter
>>> z = ['TTGGGG**TCTCCAT**',TTCTTC**TCTCCAT**',TTGGGG**TCTCCAT**','TTCTTC**TCTCCAT**','TATTAT**TCTCCAT**']
>>> Counter(z)
Counter({'TTGGGG**TCTCCAT**':2, 'TTGGGG**TCTCCAT**': 2, 'TATTAT**TCTCCAT**': 1})