Question

巨大的文件是这样的。

@delimiter...xxxxxxx     1st line
atgccccccccccccccc...    2nd line
+                        3rd line
agtrc!%%^*()_+!...       4th line

这四行继续。分隔符可以在第1行。我想要做的是如果分隔符位于第1行，我想写下以下4行。

这是我的代码。

with open("hugefile") as fin, open("hugefile_out") as fout:
    for line in fin:
        if delimiter in line:
            1st_line = line
            2nd_line = fin.next()
            3rd_line = fin.next()
            4th_line = fin.next()
            fout.write(1st_line + 2nd_line + 3rd_line + 4th_line)

通常需要4到5个小时来完成这项工作。（我删除了一个功能。）有没有办法让它更快？（我使用pypy。）输入文件是1~100Gb所以那些重复的代码似乎没必要。

也许是这样的？

           fout.write(line + fin.next() + fin.next() + fin.next())

谢谢你的支持！

Answer 1

我建议采用以下方法：

使用标记表示您已看到分隔符并且当前正在输出行
使用索引了解您输出的行数
索引为＆gt;后停止输出行4并将标志重置为false（或者，如果你只想找到一个集合，你可以完全摆脱迭代）

因此，代码将是这样的：

sawDelim = False
idx = 1
with open("hugefile") as fin, open("hugefile_out") as fout:
    for line in fin:
        if delimiter in line:
            sawDelim = True

        if sawDelim:               
            fout.write(line)
            idx += 1

        # now that we've printed out 4 lines, reset and keep looking
        # (or could also bail if you want to only find one set)
        if (idx > 4):
            idx = 1
            sawDelim = False

Python 2.7。从一个巨大的文件中写下4行

1 个答案: