Question

我需要处理一个大文本文件（4 GB）。哪个数据为：

12 23 34
22 78 98
76 56 77

我需要读取每一行，并根据行进行一些工作。目前我正在做：

sample = 'filename.txt'

with open(sample) as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]
      do_someprocess()

执行需要花费大量时间。还有其他更好的方法在python中执行此操作吗？

Answer 1

如果do_someprocess()与读取行相比需要很长时间，并且您有额外的CPU核心，则可以使用多处理模块。

尽可能尝试使用pypy。对于某些计算密集型任务，它比cpython快几十倍

如果文件中存在大量重复的内容，使用dict映射会比int()更快，因为它可以节省创建新int对象的时间。

第一步是在评论中建议@nathancahill建议。然后将精力集中在可以获得最大收益的部分。

Answer 2

split()会返回一个列表。然后你试图通过

访问第一，第二和第三个元素

line = [int(i) for i in line]
  a = line[0]
  b = line[1]
  c = line[2]

您可以直接说a,b,c = line.split()，然后a将包含line[0]，b将包含line[1]而c将包含line[2] {1}}。这应该可以节省你一些时间。

with open(sample) as f:
    for line in f:
      a,b,c = line.split() 
      do_someprocess()

一个例子：

with open("sample.txt","r") as f:
    for line in f:
        a,b,c = line.split()
        print a,b,c

.txt文件

12 34 45
78 67 45

输出：

12 34 45
78 67 45

编辑：我想过详细说明。我使用timeit()模块来比较代码运行所花费的时间。如果我在这里做错了，请告诉我。以下是编写代码的OP方式。

v = """ with open("sample.txt","r") as f:
    for line in f:
      line = line.split() 
      line = [int(i) for i in line]
      a = line[0]
      b = line[1]
      c = line[2]"""
import timeit
print timeit.timeit(stmt=v, number=100000)

输出：

8.94879606286   ## seconds to complete 100000 times.

以下是我编写代码的方式。

s = """ with open("sample.txt","r") as f:
            for line in f:
                a,b,c = [int(s) for s in line.split()]"""

import timeit
print timeit.timeit(stmt=s, number=100000)

输出：

7.60287380216 ## seconds to complete same number of times.

在python中加载大文本文件

2 个答案: