我有这样的数据
contig34706 sp|A1IVM0|A1IVM0_TRIDB 96
contig118453 sp|A1IVM0|A1IVM0_TRIDB 98
contig12943 tr|A7XPA0|A7XPA0_TRIDB 96
contig92741 tr|A7XPA0|A7XPA0_TRIDB 96
contig92741 tr|A8QU19|A8QU19_TRIDB 94
contig523 tr|A9U8G7|A9U8G7_TRIDB 94
contig14487 tr|A9U8G7|A9U8G7_TRIDB 95
contig80716 tr|A9U8G7|A9U8G7_TRIDB 93
我想知道文件中有多少重叠群和蛋白质,但显然没有考虑重复元素,所以我想将第[1]行中的元素相互比较并计算和打印元素,但是那些重复。第[0]行相同。
import re
count = 0
lines = open("file.txt", "r").readlines()
for line in lines:
new_list=re.split(r'\t+',line.strip())
contig=new_list[0]
protien=new_list[1]
for element in contig:
if element != element:
count += 1
else:
好吧,我不知道如何完成,如果这是正确的方式......
我想要的输出
sp|A1IVM0|A1IVM0_TRIDB 96
tr|A7XPA0|A7XPA0_TRIDB 96
tr|A8QU19|A8QU19_TRIDB 94
tr|A9U8G7|A9U8G7_TRIDB 94
答案 0 :(得分:2)
track=()
lines = open("file.txt", "r").readlines()
for line in lines:
new_list=line.split()
if new_list[1] not in track:
print new_list[1]," ", new_list[2]
track = (track, new_list[1])
如果第[2]行是新的,则打印并添加到元组以跟踪副本。
输出:
sp|A1IVM0|A1IVM0_TRIDB 96
tr|A7XPA0|A7XPA0_TRIDB 96
tr|A8QU19|A8QU19_TRIDB 94
tr|A9U8G7|A9U8G7_TRIDB 94
答案 1 :(得分:1)
dc, dp = {}, {}
with open('file.txt') as f:
for line in f:
v = line.split()
dc[v[0]] = dp[v[1]] = 1
print len(dc), len(dp)
for k in dc: print k
for k in dp: print k
答案 2 :(得分:1)
我想知道文件中有多少重叠群和蛋白质
这是一种方法:
from collections import defaultdict
count_contig = defaultdict(int)
count_protein = defaultdict(int)
with open('file.txt') as f:
for line in f:
line = line.split()
count_contig[line[0]] += 1
count_protein[line[1]] += 1
print 'Number of unique contigs:', len(count_contig)
print 'Number of unique proteins:', len(count_protein)
输出:
独特的重叠群数量:7
独特蛋白质的数量:6
您可以访问每个重叠群/蛋白质的实际出现次数,如下所示:
count_contig['contig92741'] # returns 2
count_contig['unknown_contig'] # returns 0, thanks to defaultdict
要列出独特的重叠群/蛋白质数量,只需访问词典的键:
print 'Unique contigs are:', count_config.keys()
print 'Unique protens are:', count_protein.keys()
输出:
独特的重叠群是:['contig12943','contig523','contig80716','contig118453','contig14487','contig34706','contig92741']
独特的protens是:['tr | A9U8G7 | A9U8G7_TRIDB','tr | A7XPA0 | A7XPA0_TRIDB','tr | A8QU19 | A8QU19_TRIDB','sp | A1IVM0 | A1IVM0_TRIDB','sp | A5A8T8 | A5A8T8_TRIDB','tr | A8QTZ7 | A8QTZ7_TRIDB']
Dictionary很精彩,你应该尝试更多地了解它。
答案 3 :(得分:1)
您总是可以返回dict
的每个计数:
contigs = """contig34706 sp|A1IVM0|A1IVM0_TRIDB 96
contig118453 sp|A1IVM0|A1IVM0_TRIDB 98
contig12943 tr|A7XPA0|A7XPA0_TRIDB 96
contig92741 tr|A7XPA0|A7XPA0_TRIDB 96
contig92741 tr|A8QU19|A8QU19_TRIDB 94
contig523 tr|A9U8G7|A9U8G7_TRIDB 94
contig14487 tr|A9U8G7|A9U8G7_TRIDB 95
contig80716 tr|A9U8G7|A9U8G7_TRIDB 93"""
from collections import Counter
contigs = [c.split()[1] for c in contigs.split("\n")]
contig_cnts = Counter(contigs)
如果您不关心计数,甚至是set
:
contig_set = set(contigs)