我有一个名为" self .__ sequences"有一些DNA序列,以下是该列表的一部分
['AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG\n', 'TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA\n', 'CTGAGTAAATCATATACTCAATGATTTTTTTATGTGTGTGCATGTGTGCTGTTGATATTCTTCAGTACCAAAACCCATCATCTTATTTGCATAGGGAAGT\n', 'CTGCCAGCACGCTGTCACCTCTCAATAACAGTGAGTGTAATGGCCATACTCTTGATTTGGTTTTTGCCTTATGAATCAGTGGCTAAAAATATTATTTAAT\n', 'ACTTATATTATGTTGACACTCAAAAATTTCAGAATTTGGAGTATTTTGAATTTCAGATTTTCTGATTAGGGATGTACCTGTACTTTTTTTTTTTTTTTTT\n', 'TTTGTTCTTTTTGTAATGGGGCCAGATGTCACTCATTCCACATGTAGTATCCAGATTGAAATGAAATGAGGTAGAACTGACCCAGGCTGGACAAGGAAGG\n', 'AAGAGGTAAAGGAAACAGACTGATGGCTGGAGAATTTGACAACGTATAAGAGAATCTGAGAATTCTTTTGAAAAATACTCAAATTTCCAGCCAAGATAGA\n', 'ACACTTGAGCATTAAGAGGAAACACCAAGGAAACAGATTTTAGGTCAAGAAAAAGAAGAGCTCTCTCATGTCAGAGCAGCCTAGAGCAGGAAAGTGCTGT\n', 'ACATCTATGCCCACCACACCTNGGTATGCANTGATGCTCATGAGATGGGAGGTGGCTACAGATTGCTCCATATAGAAATGTTACCTAGCATGTTAAAGAT\n']
我想为每个DNA序列计算gc conent并返回带有DNA:gc内容的字典。例如,类似的东西:
{(AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG:0.5), (TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA:0.33)}
gc content= (Count(G) + Count(C)) / (Count(A) + Count(T) + Count(G) + Count(C))
我写下面的代码,但它什么也没给我!
def get_gc_content(self):
for i in range (len(self.__sequence)):
if seq[i] in self.__sequence:
return (seq.count('G')+seq.count('C'))/float(seq.count('G')+seq.count('C')+seq.count('T')+seq.count('A'))
有人可以帮我改进我的代码吗?
答案 0 :(得分:1)
假设您从序列中分析DNA(不是RNA等)和strip()
换行符和空格,seq.count('A') + seq.count('G') + seq.count('C') + seq.count('T')
总是等于len(seq)
。
请注意,seq.some_method_name
对整个序列进行操作。您根本不需要迭代序列元素的for
循环。
i in self.__sequence
总是False
(你选择一个整数,看它是否与四个可能的字母序列相符),所以它什么也没做。
循环中的第一个return
将打破循环。
以下是一段似乎有用的代码:
def getContentOf(target_list, seq):
# add a 1 for each nucleotide in target_list
target_count = sum(1 for x in seq if x in target_list)
return float(target_count) / len(seq)
答案看起来很合理:
>>> getContentOf(['G', 'C'], 'AGCT')
0.5
>>> getContentOf(['G', 'C'], 'AGCTATAT')
0.25
>>> _
所以你需要的是{seq: getContentOf(['G', 'C'], seq)}
BTW您在帖子中提供的序列似乎与您的示例状态具有不同的G + C内容。
答案 1 :(得分:0)
怎么样:
self.myDict = {}
def create_dna_dict(self):
for i in seq:
if i in self.__sequence:
self.myDict[i] = (seq.count('G') + seq.count('C')) / float(seq.count('G') + seq.count('C') + seq.count('T') + seq.count('A'))
但有一些事情:
seq
不应该是self.seq
吗?__sequence
是一个非常奇怪的变量名。这似乎是非常规的。我很确定你是dict,有它的元组和缺乏字符串:
{(AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG:0.5), (TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA:0.33)}
看起来应该是这样的,删除了那些括号,键是字符串:
{"AAAACATCAGTATCCATCAGGATCAGTTTGGAAAGGGAGAGGCAATTTTTCCTAAACATGTGTTCAAATGGTCTGAGACAGACGTTAAAATGAAAAGGGG":0.5, "TTAGAAACTATGGGATTATTCACTCCCTAGGTACTGAGAATGGAAACTTTCTTTGCCTTAATCGTTGACATCCCCTCTTTTAGGTTCTTGCTTCCTAACA":0.33}