Question

很难解决这个问题或者找到关于它的任何好的提示。我试图循环一个文件，稍微修改每一行，然后循环不同的文件。如果第二个文件中的行以第一个文件中的行开头，那么第二个文件中的跟随行应该写入第三个文件。

with open('ids.txt', 'rU') as f:
        with open('seqres.txt', 'rU') as g:
                for id in f:
                        id=id.lower()[0:4]+'_'+id[4]
                        with open(id + '.fasta', 'w') as h:
                                for line in g:
                                        if line.startswith('>'+ id):
                                                h.write(g.next())

显示所有正确的文件，但它们是空的。是的，我确定如果有真实案例。 :-)
“seqres.txt”包含具有特定格式的ID号的行，每行后跟一行数据。 “ids.txt”具有不同格式的感兴趣的ID号。我希望每行数据在其自己的文件中都有一个有趣的ID号。

万分感谢任何有一点建议的人！

这是最小的解决方案，但请看下面的答案：

with open('ids.txt', 'rU') as f:
        fl = f.readlines()
        with open('seqres.txt', 'rU') as g:
                gl = g.readlines()
                for id in fl:
                        id=id.lower()[0:4]+'_'+id[4]
                        with open(id + '.fasta', 'w') as h:
                                for line in xrange(len(gl)):
                                        if gl[line].startswith('>'+ id):
                                                h.write(gl[line+1])

~~现在，我想知道是否有办法让它更快？~~查看Tim和Brian的答案。

Answer 1

这是一个大致扁平化的实现。根据您为每个ID获得的点击数量，以及“seqres”中有多少条目，您可以重新设计它。

# Extract the IDs in the desired format and cache them
ids = [ x.lower()[0:4]+'_'+x[4] for x in open('ids.txt','rU')]
ids = set(ids)

# Create iterator for seqres.txt file and pull the first value
iseqres = iter(open('seqres.txt','rU'))
lineA = iseqres.next()

# iterate through the rest of seqres, staggering
for lineB in iseqres:
  lineID = lineA[1:7]
  if lineID in ids:
    with open("%s.fasta" % lineID, 'a') as h:
      h.write(lineB)
  lineA = lineB

Answer 2

问题是你只是循环遍历文件g一次 - 在你第一次将文件索引位置留在文件末尾时读完它之后，所以任何进一步的读取都会因EOF而失败。每次循环都需要重新打开g。

然而，这将是非常低效的 - 您正在重复读取相同的文件，f中的每一行一次。在开始时将所有g读入数组并使用它会快几个数量级，只要它适合内存。

Answer 3

对于速度，你真的想避免多次循环同一个文件。这意味着当你使用O（N + M）算法时，你已经变成了O（N * M）算法。

要实现此目的，请将id列表读入快速查找结构（如集合）。由于只有4600这种内存形式不应该有任何问题。

新解决方案还将列表读入内存。可能不是一个只有几十万行的巨大问题，但它浪费的内存比你需要的多，因为你可以在一次通过中完成整个过程，只将较小的ids.txt文件读入内存。你可以在上一行有趣的时候设置一个标志，这将指示下一行写出来。

这是一个重新设计的版本：

with open('ids.txt', 'rU') as f:
    interesting_ids = set('>' + line.lower()[0:4] + "_" + line[4] for line in f)  # Get all ids in a set.

found_id = None
with open('seqres.txt', 'rU') as g:
    for line in g:
        if found_id is not None:
            with open(found_id+'.fasta','w') as h:
                h.write(line)

        id = line[:7]
        if id in interesting_ids: found_id = id
        else: found_id = None

Answer 4

我认为从你宣布为最终的代码仍然有进步。你可以使结果少一些嵌套，避免一些愚蠢的事情。

from contextlib import nested
from itertools import tee, izip

# Stole pairwise recipe from the itertools documentation
def pairwise(iterable):
    "s -> (s0,s1), (s1,s2), (s2, s3), ..."
    a, b = tee(iterable)
    next(b, None)
    return izip(a, b)

with nested(open('ids.txt', 'rU'), open('seqres.txt', 'rU')) as (f, g):
    for id in f:
        id = id.lower()[0:4] + '_' + id[4]
        with open(id + '.fasta', 'w') as h:
            g.seek(0) # start at the beginning of g each time
            for line, next_line in pairwise(g):
                if line.startswith('>' + id):
                    h.write(next_line)

这是对您在

中发布的最终代码的改进

它不会不必要地将整个文件读入内存，而是简单地遍历文件对象。（这可能是也可能不是g的最佳选择，真的。它肯定会更好地扩展。）
如果我们已经在gl[line+1]的最后一行，它不包含使用gl的崩溃条件
- 根据g的实际情况，可能会有比pairwise更适用的内容。
它没有深层嵌套。
对于操作员周围的空格和压痕深度等内容，它符合PEP8。
该算法为O（n * m），其中n和m是f和g中的行数。如果f的长度无限制，您可以使用一组ID来将算法减少到O（n）（g中的行数线性）。

Answer 5

处理完ids.txt文件中的第一行后，文件seqres.txt已用尽。嵌套嵌套有问题。此外，您正在修改for line in g循环内的迭代器。不是个好主意。

如果你真的想追加ID匹配的行后面的行，那么这样的事情可能会更好：

with open('ids.txt', 'rU') as f:
    ids = f.readlines()
with open('seqres.txt', 'rU') as g:
    seqres = g.readlines()

for id in ids:
    id=id.lower()[0:4]+'_'+id[4]
    with open(id + '.fasta', 'a') as h:
    for line in seqres:
        if line.startswith('>'+ id):
            h.write(seqres.next())

循环遍历文件并在满足条件时写入下一行

5 个答案: