鉴于我的fna.gz基因组输入,我想返回第n个碱基对。从理论上讲,它将像这样:
allele = genome[14325]
print(allele)
#: G
这是我现在拥有的代码:
from Bio import SeqIO
import gzip
from Bio.Alphabet import generic_dna
input_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\GCF_000001405.38_GRCh38.p12_genomic.fna.gz"
output_file = r"C:\Users\blake\PycharmProjects\Transcendence3.0\DNA\Probabilities"
with gzip.open(input_file, "rt") as handle:
for record in SeqIO.parse(input_file, "fasta", generic_dna):
fasta_sequences = SeqIO.parse(open(input_file), 'fasta')
print("seq parsed")
with open(output_file) as out_file:
for fasta in fasta_sequences:
name, sequence = fasta.id, str(fasta.seq)
new_allele = tell_basepair(sequence)
write_fasta(out_file)
def tell_basepair(n, seq):
bp = seq[n-1]
return bp
但是它不起作用,我得到一个错误:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 386: character maps to <undefined>
答案 0 :(得分:0)
您可以
with gzip.open("practicezip.fasta.gz", "rt") as handle:
for record in SeqIO.parse(handle, "fasta"):
#your code
from Bio import SeqIO
from Bio.Alphabet import generic_dna
filename = "yourfastafilename"
for record in SeqIO.parse(filename, "fasta", generic_dna):
# your code
除了UnicodeDecodeError错误外,您可能还想定义函数some_function(sequence)
,否则Python在调用它时将不知道该怎么做。例如:
def tell_basepair(n, seq):
bp = seq[n-1]
return bp