我实际上有一个巨大的multifasta seq文件,如:
>Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042 (busco)
AAATGAGATA
>EOGATTT_0042_0042 (busco)
ATGGAAT
>EOGATTA_0042_0042 (busco)
ATAGGAGAT
我实际上想要计算我的文件中有多少Busco基因(它们都以名称>EOG
开头)为此,我有一个脚本:
count=1
for record in SeqIO.parse("concatenate_with_busco_names_0035_0042_aa.fa", "fasta"):
count+=1
print(count)
set_of_labels = set()
with open("concatenate_with_busco_names_0035_0042_aa.fa") as f:
for line in f:
if line.startswith('>EOG'):
label = line[4:].split('_')[0]
set_of_labels.add(label)
print("Total number of Busco genes: " + str(len(set_of_labels)))
但我还想知道每个相应的配对之间有多少基因。我解释得更好;
正如您所看到的,每个seqID such _number_number
中有两个数字
这些数字是特殊的,第一个_number
对应于序列所属的物种,第二个_number
对应于特定的数字。
无论如何,我想,如果有可能计数我做了多少不同的Busco基因我得到seq与第一个数字_0035
和_0042
并且
seq ID的数量:
_0035_0042
_0035_0042
_0042_0042
_0042_0035
在上面的例子中,它将是:
Total busco: 5 (I count only once if the >busco is present even if _number are different)
Total busco for the specie _0035 (_0035_0042 and _0035_0035) : 3
Total busco for the specie _0042 (_0042_0042 and _0042_0035) : 4
Total busco for the specific specie _0035_0042 : 3
Total busco for the specific specie _0042_0035 : 0
Total busco for the specific specie _0042_0042 : 4
Total busco for the specific specie _0035_0035 : 0
嗨希望很清楚,事实上我的脚本已经完成了第一部分(total busco:
),我只需要计算其他7种方式。
这是真实数据data
答案 0 :(得分:1)
除 busco 计数器外,您还可以使用多个计数器来获取物种和特定物种的个别计数,例如:
import collections
busco = collections.defaultdict(int) # busco counter
species = collections.defaultdict(int) # species counter
specific_species = collections.defaultdict(int) # specific species counter
with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
for line in f:
if line[:4] == ">EOG":
entry = line.split()[0][4:].split('_')
busco[entry[0]] += 1
species[entry[1]] += 1
specific_species[entry[1] + "_" + entry[2]] += 1
print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
print("Total busco for the specie {}: {}".format(specie, total))
for specie, total in specific_species.items():
print("Total busco for the specific specie {}: {}".format(specie, total))
哪个应该产生:
Total busco: 5 Total busco for the specie 0035: 3 Total busco for the specie 0042: 4 Total busco for the specific specie 0035_0042: 3 Total busco for the specific specie 0042_0042: 4
未列出的(特定)物种不会出现,但如果您确实要将它们打印出来,可以将它们与species
计数器合并并打印其值(默认为0
):
import itertools
print("Total busco: {}".format(len(busco)))
for specie, total in species.items():
print("Total busco for the specie {}: {}".format(specie, total))
for specie in itertools.product(species, species):
s = "_".join(specie)
print("Total busco for the specific specie {}: {}".format(s, specific_species[s]))
哪个收益率:
Total busco: 5 Total busco for the specie 0035: 3 Total busco for the specie 0042: 4 Total busco for the specific specie 0035_0035: 0 Total busco for the specific specie 0035_0042: 3 Total busco for the specific specie 0042_0035: 0 Total busco for the specific specie 0042_0042: 4
更新:如果您在 busco 的唯一计数之后,则需要将计数反转为 specie上的索引 / < em> specific specie 并收集set
中的 busco 值作为其值。然后你需要的是得到每组的长度,如:
import collections
import itertools
busco = set()
species = collections.defaultdict(set)
specific_species = collections.defaultdict(set)
with open("concatenate_with_busco_names_0035_0042_aa.fa", "r") as f:
for line in f:
if line[:4] == ">EOG":
entry = line.split()[0][4:].split('_')
busco.add(entry[0])
species[entry[1]].add(entry[0])
specific_species[entry[1] + "_" + entry[2]].add(entry[0])
print("Total busco: {}".format(len(busco)))
for specie, buscos in species.items():
print("Total busco for the specie {}: {}".format(specie, len(buscos)))
for specie in itertools.product(species, species):
s = "_".join(specie)
print("Total busco for the specific specie {}: {}".format(s, len(specific_species[s])))
为您的完整数据打印:
Total busco: 421 Total busco for the specie 0035: 402 Total busco for the specie 0042: 397 Total busco for the specific specie 0035_0035: 392 Total busco for the specific specie 0035_0042: 262 Total busco for the specific specie 0042_0035: 305 Total busco for the specific specie 0042_0042: 383
答案 1 :(得分:1)
与Python标准库中的Counter
类相比,这是微不足道的:
from collections import Counter
from io import StringIO
label_counter = Counter()
specy_counter = Counter()
specific_specy_counter = Counter()
# replace this with an open() on your real file
finput = StringIO(""">Seq_1_0035_0035
ATTGGAT
>Seq_2_0042_0035
ATTGAGGA
>EOGWX56TR_0035_0042 (busco)
ATGGAGAT
>EOGWX56TR_0042_0042 (busco)
ATGGATGG
>Seq6_035_0042
ATGGGAATAG
>EOG55FTG_0035_0042 (busco)
AATGGATA
>EOG5GFFTA_0042_0042 (busco)
ATGGAGATA
>Seq56_0035_0042
ATGGAGATAT
>EOGATTT_0035_0042 (busco)
AAATGAGATA
>EOGATTT_0042_0042 (busco)
ATGGAAT
>EOGATTA_0042_0042 (busco)
ATAGGAGAT""")
for line in finput:
try:
if line.startswith('>EOG'):
label, specy, specific = line[4:].replace(" (busco)", "").strip().split('_')
label_counter[label] += 1
specy_counter[specy] += 1
specific_specy_counter[(specy, specific)] += 1
except ValueError:
print("Invalid line:", line)
print("Total busco:", len(label_counter))
for specy, count in specy_counter.items():
print("Total busco for the specie {} : {}".format(specy, count))
for (specy, specific), count in specific_specy_counter.items():
print("Total busco for the specific specy {}_{} : {}".format(specy, specific, count))
请注意,0值的物种或细节不会出现。