Question

我正在用Python编写一个简短的程序来读取FASTA文件，该文件通常采用以下格式：

>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA

我已经创建了另一个程序，它读取了这个FASTA文件的第一行（又名标题），现在我希望第二个程序从序列开始读取和打印。

我该怎么做？

到目前为止，我有这个：

FASTA = open("test.txt", "r")

def readSeq(FASTA):
    """returns the DNA sequence of a FASTA file"""
    for line in FASTA:
        line = line.strip()
        print line          


readSeq(FASTA)

谢谢你们

-Noob

Answer 1

def readSeq(FASTA):
    """returns the DNA sequence of a FASTA file"""
    _unused = FASTA.next() # skip heading record
    for line in FASTA:
        line = line.strip()
        print line

阅读the docs on file.next()，了解为什么要警惕file.readline()与for line in file:混合

Answer 2

你应该显示你的脚本。从第二行读取，类似这样的

f=open("file")
f.readline()
for line in f:
    print line
f.close()

Answer 3

您可能有兴趣检查BioPythons处理Fasta文件（source）。

def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
    """Generator function to iterate over Fasta records (as SeqRecord objects).

handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.

If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.

Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
    #Skip any text before the first record (e.g. blank lines, comments)
    while True:
        line = handle.readline()
        if line == "" : return #Premature end of file, or just empty?
        if line[0] == ">":
            break

    while True:
        if line[0]!=">":
            raise ValueError("Records in Fasta files should start with '>' character")
        if title2ids:
            id, name, descr = title2ids(line[1:].rstrip())
        else:
            descr = line[1:].rstrip()
            id = descr.split()[0]
            name = id

        lines = []
        line = handle.readline()
        while True:
            if not line : break
            if line[0] == ">": break
            #Remove trailing whitespace, and any internal spaces
            #(and any embedded \r which are possible in mangled files
            #when not opened in universal read lines mode)
            lines.append(line.rstrip().replace(" ","").replace("\r",""))
            line = handle.readline()

        #Return the record and then continue...
        yield SeqRecord(Seq("".join(lines), alphabet),
                         id = id, name = name, description = descr)

        if not line : return #StopIteration
    assert False, "Should not reach this line"

Answer 4

很高兴见到另一位生物信息学家：）

在line.strip（）调用

之上的for循环中包含一个if子句

def readSeq(FASTA):
    for line in FASTA:
        if line.startswith('>'):
            continue
        line = line.strip()
        print(line)

Answer 5

执行此操作的pythonic和简单方法是切片表示法。

>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']

那说“从第二个（索引1）到结尾，给我所有行的元素。

切片表示法的其他一般用法：

s[i:j]  slice of s from i to j
s[i:j:k]    slice of s from i to j with step k (k can be negative to go backward)

可以省略i或j（表示开头或结尾），j可以是负数表示结尾的元素数。

s[:-1]     All but the last element.

编辑以回应gnibbler的评论：

如果文件非常庞大，您可以使用iterator slicing获得相同的效果，同时确保不会将整个内容记录在内存中。

import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1): 
    print line

“islicing”没有常规切片的漂亮语法或额外功能，但这是一个很好的记忆方法。

如何使用readline（）从第二行开始？

5 个答案: