Question

我有一个包含数百万行的csv文件。我想从10,000,000行开始迭代。目前我有代码：

<?php

 curl -X POST http://build:f9280f220bf7b75396f83a0@mobile-jenkins.domain.com:8080/job/qa-trserver-git/build
--data-urlencode json='{"parameter": [{"name":"POST_RESULTS", "value":"true"}, {"name":"RUN_ID", "value":"744"}, {"name":"PLAN", "value":"SamplePlan"}]}'

?>

这样可行，但在感兴趣的行出现之前需要几秒钟才能运行。据推测，所有不需要的行都被不必要地加载到python中，从而减慢了速度。有没有办法在某一行上启动迭代过程 - 即没有读入数据的开始。

Answer 1

您可以使用islice：

from itertools import islice

with open(csv_file, encoding='UTF-8') as f:
    r = csv.reader(f)
    for row in islice(r,  10000000, None):
            process_row(row)

它仍然遍历所有行，但效率更高。

你也可以使用consume recipe调用消耗C语言迭代器的函数，在文件对象上调用它之前将它传递给< em> csv.reader ，所以你也可以避免用读者不必要地处理这些行：

import collections
from itertools import islice
def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)


with open(csv_file, encoding='UTF-8') as f:
    consume(f, 9999999)
    r = csv.reader(f)
    for row  in r:
          process_row(row)

正如Shadowranger评论的那样，如果一个文件可以包含嵌入的换行符，那么你将不得不使用阅读器并传递newline=""但如果不是这种情况那么使用do消耗文件对象，因为性能差异将相当大特别是如果你有很多专栏。

从Python中的某个csv文件行迭代

1 个答案: