如何计算子字符串在字典值中的出现?

时间:2019-05-08 07:40:02

标签: python string dictionary

我需要计算序列中母题(包括重叠)的出现(基元在标准输入的第一行中传递,而序列在后续行中传递)。序列名称以>开头,在空格之后仅是有关需要忽略的序列的注释。程序输入如下:

  AT
  >seq1 Comment......
  AGGTATA
  TGGCGCC
  >seq2 Comment.....
  GGCCGGCGC

输出应为:

   seq1: 2
   seq2: 0

我决定将第一行保存为主题,从序列名称中删除注释,将序列中的行合并为一行,并将序列名称(键)和序列(值)保存在字典中。我还为motif_count写了一个函数,想在字典值上调用它,然后将其保存在单独的字典中以进行最终输出。我可以这样做还是有更好的方法?

#!/usr/bin/env python3

import sys

sequence = sys.stdin.readlines()
motif = sequence[0]
d = {}
temp_genename = None
temp_sequence = None

def motif_count(m, s):
    count = 0
    next_pos = -1
    while True:

        next_pos = s.find(m, next_pos + 1)

    if next_pos < 0:
        break
count += 1
return count 

if sequence[1][0] != '>':

   print("ERROR")

exit(1)

for line in sequence[1:]:

    if line[0] == '>':

       temp_genename = line.split(' ')[0].strip()
       temp_sequence = ""

    else:

       temp_sequence += line.strip()

d[temp_genename] = temp_sequence

for value in d:
   motif_count(motif, value)

1 个答案:

答案 0 :(得分:0)

您可以使用字典和字符串表达式来简化代码,以获取处理所需的相关关键字。假设您的序列值是一致的并且与您提供的序列值相似,则可以拆分冗余的This sequence is from,然后稍后过滤uppercase字母,最后计算出主题的出现。可以按照以下步骤进行操作:

def motif_count(motif, key):
    d[key] = d[key].count(motif)

sequence = """AT
>seq1 This sequence is from bacterial genome
AGGTATA
TGGCGCC
>seq2 This sequence is rich is CG
GGCCGGCGC""".split('\n')

d = {}
# print error if format is wrong
if sequence[1][0] != '>':
    print("ERROR")

else: 
    seq  = "".join(sequence).split('>')[1:]
    func = lambda line: line.split(' This sequence is ')
    d    = dict((func(line)[0], ''.join([c for c in func(line)[1] if c.isupper()]))
                 for line in seq)

    motif = sequence[0]
    # replace seq with its count
    for key in d:
       motif_count(motif, key)

    # print output
    print(d)

输出:

{'seq1': 2, 'seq2': 0}