比较行中的元素

时间:2014-02-03 13:20:50

标签: python

我有这样的数据

contig34706   sp|A1IVM0|A1IVM0_TRIDB  96
contig118453  sp|A1IVM0|A1IVM0_TRIDB  98
contig12943   tr|A7XPA0|A7XPA0_TRIDB  96    
contig92741   tr|A7XPA0|A7XPA0_TRIDB  96    
contig92741   tr|A8QU19|A8QU19_TRIDB  94
contig523     tr|A9U8G7|A9U8G7_TRIDB  94    
contig14487   tr|A9U8G7|A9U8G7_TRIDB  95
contig80716   tr|A9U8G7|A9U8G7_TRIDB  93

我想知道文件中有多少重叠群和蛋白质,但显然没有考虑重复元素,所以我想将第[1]行中的元素相互比较并计算和打印元素,但是那些重复。第[0]行相同。

import re
count = 0
lines = open("file.txt", "r").readlines()
for line in lines:
    new_list=re.split(r'\t+',line.strip())
    contig=new_list[0]
    protien=new_list[1]
    for element in contig:
        if element != element:
            count += 1
        else:
好吧,我不知道如何完成,如果这是正确的方式...... 我想要的输出

 sp|A1IVM0|A1IVM0_TRIDB  96
 tr|A7XPA0|A7XPA0_TRIDB  96        
 tr|A8QU19|A8QU19_TRIDB  94
 tr|A9U8G7|A9U8G7_TRIDB  94    

4 个答案:

答案 0 :(得分:2)

track=()
lines = open("file.txt", "r").readlines()
for line in lines:
    new_list=line.split()
    if new_list[1] not in track:
        print new_list[1]," ", new_list[2]
        track = (track, new_list[1])

如果第[2]行是新的,则打印并添加到元组以跟踪副本。

输出:

sp|A1IVM0|A1IVM0_TRIDB   96
tr|A7XPA0|A7XPA0_TRIDB   96
tr|A8QU19|A8QU19_TRIDB   94
tr|A9U8G7|A9U8G7_TRIDB   94

答案 1 :(得分:1)

dc, dp = {}, {}
with open('file.txt') as f:
    for line in f:
        v = line.split()
        dc[v[0]] = dp[v[1]] = 1 
print len(dc), len(dp)
for k in dc: print k
for k in dp: print k

答案 2 :(得分:1)

  

我想知道文件中有多少重叠群和蛋白质

这是一种方法:

from collections import defaultdict
count_contig = defaultdict(int)
count_protein = defaultdict(int)
with open('file.txt') as f:
    for line in f:
        line = line.split()
        count_contig[line[0]] += 1
        count_protein[line[1]] += 1
print 'Number of unique contigs:', len(count_contig)
print 'Number of unique proteins:', len(count_protein)

输出:

  

独特的重叠群数量:7

     

独特蛋白质的数量:6

您可以访问每个重叠群/蛋白质的实际出现次数,如下所示:

count_contig['contig92741'] # returns 2
count_contig['unknown_contig'] # returns 0, thanks to defaultdict

要列出独特的重叠群/蛋白质数量,只需访问词典的键:

print 'Unique contigs are:', count_config.keys()
print 'Unique protens are:', count_protein.keys()

输出:

  

独特的重叠群是:['contig12943','contig523','contig80716','contig118453','contig14487','contig34706','contig92741']

     

独特的protens是:['tr | A9U8G7 | A9U8G7_TRIDB','tr | A7XPA0 | A7XPA0_TRIDB','tr | A8QU19 | A8QU19_TRIDB','sp | A1IVM0 | A1IVM0_TRIDB','sp | A5A8T8 | A5A8T8_TRIDB','tr | A8QTZ7 | A8QTZ7_TRIDB']

Dictionary很精彩,你应该尝试更多地了解它。

答案 3 :(得分:1)

您总是可以返回dict的每个计数:

contigs = """contig34706   sp|A1IVM0|A1IVM0_TRIDB  96
contig118453  sp|A1IVM0|A1IVM0_TRIDB  98
contig12943   tr|A7XPA0|A7XPA0_TRIDB  96    
contig92741   tr|A7XPA0|A7XPA0_TRIDB  96    
contig92741   tr|A8QU19|A8QU19_TRIDB  94
contig523     tr|A9U8G7|A9U8G7_TRIDB  94    
contig14487   tr|A9U8G7|A9U8G7_TRIDB  95
contig80716   tr|A9U8G7|A9U8G7_TRIDB  93"""

from collections import Counter

contigs = [c.split()[1] for c in contigs.split("\n")]
contig_cnts = Counter(contigs)

如果您不关心计数,甚至是set

contig_set = set(contigs)