Question

我需要编写一个函数，它将文件名（fasta文件）作为参数，读取序列，并为每个明确的序列打印序列ID及其分子量。 fasta文件包含模糊和明确的序列。

到目前为止，我将这两段代码单独使用。我不知道如何跳过fasta文件中的模糊序列，只计算明确序列上的分子量。显然，如果我尝试它会给出错误，因为我只输入了ACGT的值而不是模糊的值。任何人都可以帮我弄清楚如何跳过这些序列？谢谢！

另外，我不明白如何将它们组合成一个函数，我已经尝试为一个函数制作2个for循环，但它总是会出错。我想我必须更改calc_mol_weight函数的参数以匹配上面的seq_records，但我不明白如何使它们兼容。

seq_records = SeqIO.parse('short.fasta', 'fasta', alphabet=generic_dna)
seq_record_list = list(seq_records)
for seq_rec in seq_record_list:
    print(f'{seq_rec.id}')

def calc_mol_weight(sequence):   
    mol_weight = 0.0
    nucleotide_weights = {'A':331.2218, 'T':322.2085, 'C':307.1971, 'G':347.2212}
    for nucl in sequence:
        mol_weight += nucleotide_weights[nucl]
    return mol_weight

供参考 - short.fasta：

>seq_7009 random sequence
DGRGGGWAVCVAACGTTGAT
>seq_418 random sequence
GAGCTGVTATST
>seq_9143_unamb random sequence
ACCGTTAAGCCTTAG
>seq_2888 random sequence
RVCCWDGARATAGBCGC
>seq_1101 random sequence
CSAATGYGATNBTA
>seq_107 random sequence
WGDGHGCDCTYANGTTWCA
>seq_6946 random sequence
TCVMBRAGRSGTCCAWA
>seq_6162 random sequence
YWBGCKTGCCAAGCGCDG
>seq_504 random sequence
ADDTAACCCTCTTKA
>seq_3535 random sequence
KKGTACACCAG
>seq_4077 random sequence
SRWSCRTTRVAGDCC
> seq_1626_unamb random sequence
GGATATTACCTA

Answer 1

要跳过不明确的序列，我会想到两个解决方案。

您可以使用if nucl in nucleotide_weigths。它将检查该字符是否存在于字典中。如果是，它将返回True并且您可以评估此nucl元素，如果找不到该字符，它将返回False。然后你可以这样做：

if nucl not in nucleotide_weigths:
    break

如果模糊不清，它会打破实际的序列。

另一种选择是制作一个try / except块。

它基本上是这样的：

try:
    mol_weight += nucleotide_weights[nucl]
except:
    break

您可以在关键字之后指定错误，因为它不会阻止任何错误发生（您不希望这样），但您需要知道的是：如果引发异常（例如： ValueError, IndexError, KeyError, TypeError, ...），except block中的代码将会运行。因此break语句会让你退出循环，这个模糊的序列将被忽略。：）

至于你的功能需求和双循环，我认为错误来自第一部分，你在尝试检查你没有注册的数据时遇到了一些错误。如果它不相同，请发布您为双循环和回溯尝试的代码。：）

Answer 2

我希望这有助于：@tfabiant，运行此代码，在终端类型中：python script.py fastafile.fasta

def unambiguous(sequences):
    nucleotide_weights = {'A':331.2218, 'T':322.2085, 'C':307.1971, 'G':347.2212}
    for seq in sequences:
        seqname =  seq
        sequence = sequences[seq]
        weight = 0
        for nucleotide in sequence:
            if nucleotide not in nucleotide_weights:
                weight = "AMBIGUOUS"
                break
            else:
                weight+=nucleotide_weights[nucleotide]
        if weight != "AMBIGUOUS":
            print "%s\t\tWEIGHT %s"%(seqname, weight)
def readfasta():
    ##########I. Load Fasta File
    file = open(sys.argv[1])
    rfile = file.readline()
    seqs = {}
    ##########II. To Make fasta dictionary with the sequences
    tnv = ""#temporal name value
    while rfile != "":
        if ">" in rfile:
            tnv = string.strip(rfile)
            seqs[tnv] = ""
        else:
            seqs[tnv] += (string.strip(rfile)).upper()
        rfile = file.readline()
    return(seqs)

#To run this code, in the terminal: python readDNA.py fastafile.fasta
# OR insert this function in your code
import string, sys
sequences = readfasta()
unambiguous(sequences)

Python：打印明确DNA序列的分子量，同时跳过模糊序列

2 个答案: