我有一个输入文件,如:
COG1:aomo|At1g01190|aomo|At1g01280|aomo|At1g11600|homo|Hs10834998|homo|Hs13699816
COG2:aomo|At1g04160|somo|YAL029c|somo|YOR326w|homo|Hs10835119|aomo|At1g10260
COG3:somo|YAR009c|somo|YJL113w|aomo|At1g10260|aomo|At1g11265
由此,我想要一个简单的计数并生成一个输出文件,如:
aomo | homo | somo
COG1 3 | 2 | 0
COG2 2 | 1 | 2
COG3 2 | 0 | 2
为此,我使用:
import re
l=[]
dict={}
with open("groups.txt","r") as f:
for line in f:
items=line.split(":")
key=items[0]
if key not in dict:
dict[key]={}
string=items[1]
words=re.findall("\S+\|\S+",string)
for w in words:
tmp=w.split("|")
if tmp[0] not in l:
l.append(tmp[0])
if tmp[0] in dict[key]:
dict[key][tmp[0]]=1+dict[key][tmp[0]]
else:
dict[key][tmp[0]]=1
for i in sorted(l):
print(i,end=" ")
print("")
for k in sorted(dict.keys()):
print(k,end=" ")
for i in sorted(l):
if i in dict[k]:
print(dict[k][i],end=" ")
else:
print("0", end=" ")
print("")
它运行正常..但是当我更改输入文件时:
COG1:aomo_At1g01190|aomo_At1g01280|aomo_At1g11600|homo_Hs10834998|homo_Hs13699816
COG2:aomo_At1g04160|somo_YAL029c|somo_YOR326w|homo_Hs10835119
COG3:somo_YAR009c|somo_YJL113w|aomo_At1g10260|aomo_At1g11265
并将代码更改为:
words=re.findall("\S+\_\S+",string)
for w in words:
tmp=w.split("_")
它出现以下错误:
File "my_program.py", line 10, in (module)
string=items[1]
IndexError: list index out of range
答案 0 :(得分:1)
您无需使用功能强大的re
模块即可实现此目的。
template = '{0:4} {1:4} | {2:4} | {3:4}'
columns = ['aomo', 'homo', 'somo']
with open('groups.txt') as f:
print template.format(' ', *columns)
for line in f:
key, value = line.split(':')
counts = [value.count(column_label) for column_label in columns]
print template.format(key.strip(), *counts)
答案 1 :(得分:0)
这些是简单的方法:
>>> my_string = "COG1: aomo|At1g01190 aomo|At1g01280 aomo|At1g11600 homo|Hs10834998 homo|Hs13699816 "
>>> a,b = my_string.split(":") # will split strings on ":"
>>> a
'COG1'
>>> b
' aomo|At1g01190 aomo|At1g01280 aomo|At1g11600 homo|Hs10834998 homo|Hs13699816 '
>>> import re
>>> from collections import Counter
>>> my_count = Counter(re.findall("aomo|homo|somo",b)) # findall will find all, and Counter will give dictionary of for count of each element
>>> my_count
Counter({'aomo': 3, 'homo': 2})
>>> "{} {} {} {}".format(a,my_count.get('aomo',0),my_count.get('homo',0),my_count.get('somo',0))
'COG1 3 2 0'
答案 2 :(得分:0)
可能是第二个文件中的一些空行。因此,当分割时,它将具有长度为1的列表>> ['&#39]。访问列表[1]时会引发索引错误。