Question

我有一个“不那么大”的文件（~2.2GB），我正在尝试阅读和处理...

graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
    for line in f:
        try:
            line = line.rstrip(os.linesep)
            tokens = line.split("\t")
            if len(tokens)==3:
                src = long(tokens[0])
                destination = long(tokens[1])
                weight = float(tokens[2])
                #tup1 = (destination,weight)
                #tup2 = (src,weight)
                graph[src][destination] = weight
                graph[destination][src] = weight
            else:
                print "error ", line 
                error.write(line+"\n")
        except Exception, e:
            string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
            error.write(string)
            continue

我做错了吗？

它就像一个小时..因为代码正在读取文件..（它仍在阅读..）

跟踪内存使用量已经是20GB .. 为什么要花时间和记忆？

Answer 1

要大致了解内存的去向，可以使用gc.get_objects功能。将上面的代码包装在make_graph()函数中（无论如何这都是最佳实践），然后使用KeyboardInterrupt异常处理程序将调用包装到此函数中，该异常处理程序将gc数据打印到文件中。

def main():
    try:
        make_graph()
    except KeyboardInterrupt:
        write_gc()

def write_gc():
    from os.path import exists
    fname = 'gc.log.%i'
    i = 0
    while exists(fname % i):
        i += 1
    fname = fname % i
    with open(fname, 'w') as f:
        from pprint import pformat
        from gc import get_objects
        f.write(pformat(get_objects())


if __name__ == '__main__':
    main()

现在每当你按ctrl + c程序时，你都会得到一个新的gc.log。给出一些样本，您应该能够看到内存问题。

Answer 2

与其他编程语言相比，Python的数字类型使用了相当多的内存。对于我的设置，每个数字似乎是24个字节：

>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24

鉴于您在2.2 GB输入文件中有数亿行，报告的内存消耗不会出乎意料。

添加另一个东西，Python解释器的某些版本（包括CPython 2.6）are known for keeping so called free lists for allocation performance，特别是对于int和float类型的对象。分配后，在您的进程终止之前，不会将此内存返回到操作系统。另外看看我在第一次发现这个问题时发布的这个问题：

Python: garbage collection fails?

解决此问题的建议包括：

使用子进程进行内存饥饿计算，例如，基于multiprocessing模块
使用实现C语言功能的库，例如numpy，pandas
使用另一个解释器，例如PyPy

Answer 3

您可以做一些事情：

在数据子集上运行代码。测量所需的时间。外推到您的数据的完整大小。这将为您估计它将运行多长时间。

counter = 0 open（“final_edge_list.txt”，“r”）为f：对于f中的行：计数器+ = 1 如果counter == 200000：打破尝试： ...

在1M线上它在我的机器上运行~8秒，因此对于具有大约100M线的2.2Gb文件，它假设运行~15分钟。但是，一旦你克服了可用的记忆，它就不会再存在了。
您的图表似乎是对称的
```
graph[src][destination] = weight
graph[destination][src] = weight
```
在图表处理代码中使用graph的对称性，将内存使用量减少一半。
使用数据子集在您的代码上运行分析器，看看那里发生了什么。最简单的就是运行
```
python -m cProfile --sort cumulative youprogram.py
```
关于速度和内存分析器的文章很好：http://www.huyng.com/posts/python-performance-analysis/

Answer 4

您不需要graph为defaultdict（dict），而是用户dict; graph[src, destination] = weight和graph[destination, src] = weight会这样做。或者只有其中一个。
要减少内存使用量，请尝试将结果数据集存储在scipy.sparse矩阵中，它会消耗更少的内存并可能会被压缩。
之后您打算如何处理节点列表？

在python中读取一个大文件

4 个答案: