Question

我需要计算序列中母题（包括重叠）的出现（基元在标准输入的第一行中传递，而序列在后续行中传递）。序列名称以>开头，在空格之后仅是有关需要忽略的序列的注释。程序输入如下：

  AT
  >seq1 Comment......
  AGGTATA
  TGGCGCC
  >seq2 Comment.....
  GGCCGGCGC

输出应为：

   seq1: 2
   seq2: 0

我决定将第一行保存为主题，从序列名称中删除注释，将序列中的行合并为一行，并将序列名称（键）和序列（值）保存在字典中。我还为motif_count写了一个函数，想在字典值上调用它，然后将其保存在单独的字典中以进行最终输出。我可以这样做还是有更好的方法？

#!/usr/bin/env python3

import sys

sequence = sys.stdin.readlines()
motif = sequence[0]
d = {}
temp_genename = None
temp_sequence = None

def motif_count(m, s):
    count = 0
    next_pos = -1
    while True:

        next_pos = s.find(m, next_pos + 1)

    if next_pos < 0:
        break
count += 1
return count 

if sequence[1][0] != '>':

   print("ERROR")

exit(1)

for line in sequence[1:]:

    if line[0] == '>':

       temp_genename = line.split(' ')[0].strip()
       temp_sequence = ""

    else:

       temp_sequence += line.strip()

d[temp_genename] = temp_sequence

for value in d:
   motif_count(motif, value)

Answer 1

您可以使用字典和字符串表达式来简化代码，以获取处理所需的相关关键字。假设您的序列值是一致的并且与您提供的序列值相似，则可以拆分冗余的This sequence is from，然后稍后过滤uppercase字母，最后计算出主题的出现。可以按照以下步骤进行操作：

def motif_count(motif, key):
    d[key] = d[key].count(motif)

sequence = """AT
>seq1 This sequence is from bacterial genome
AGGTATA
TGGCGCC
>seq2 This sequence is rich is CG
GGCCGGCGC""".split('\n')

d = {}
# print error if format is wrong
if sequence[1][0] != '>':
    print("ERROR")

else: 
    seq  = "".join(sequence).split('>')[1:]
    func = lambda line: line.split(' This sequence is ')
    d    = dict((func(line)[0], ''.join([c for c in func(line)[1] if c.isupper()]))
                 for line in seq)

    motif = sequence[0]
    # replace seq with its count
    for key in d:
       motif_count(motif, key)

    # print output
    print(d)

输出：

{'seq1': 2, 'seq2': 0}

如何计算子字符串在字典值中的出现？

1 个答案: