Question

我有超过2000万行数据，每行包含60-200个int元素。我目前的使用方法：

with open("file.txt") as f:
    for block in reading_file(f):
        for line in block:                         
            a = line.split(" ")
            op_on_data(a)

其中reading_file()是一次占用大约1000行的函数。 op_on_data()是我执行一些基本操作的函数：

def op_on_data(a):
    if a[0] == "keyw":
        print 'keyw: ', a[0], a[1]
    else:
        # some_operations on arr[]
        for v in arr[splicing_here]:              
           if v > 100:
               # more_operations here
               two_d_list(particular_list_location).append(v)
        for e in arr[splicing_here]:
           if e < -100:
               two_d_list_2(particular_list_location).append(e)
    sys.stdout.flush()

最后，我一步一步将two_d_list保存到Pandas Dataframe中。我不分块保存。对于大约40,000行测试数据集，我的初始时间为~10.5 s。但是当我处理整个数据集时，我的系统在几百万行之后崩溃了。可能是因为列表太大。

我需要知道执行这些操作后保存数据的最佳方法是什么。我是继续使用列表，还是像逐行一样直接保存到函数内部的CSV文件中？如何提高速度并防止系统崩溃？

编辑：除了列表和CSV外，我还接受其他选择。

Answer 1

我将尝试使代码更高效且基于生成器。我无缘无故地看到太多的for循环。

如果您要遍历所有行，则从此开始

for line in open("file.txt"):  # open here is a generator (avoid using your own read functions)
    a = line.split(" ")
    op_on_data(a)

对于第二个代码段，以下是针对以下代码的更多代码审阅注释：

def op_on_data(a):
    if a[0] == "keyw":
        print 'keyw: ', a[0], a[1]    # Avoid printing when iterating million of lines !!!
    else:
        # some_operations on arr[]
        for v in arr[splicing_here]:  
           if v > 100:
               # more_operations here
               two_d_list(particular_list_location).append(v)
        for e in arr[splicing_here]:
           if e < -100:
               two_d_list_2(particular_list_location).append(e)
    sys.stdout.flush()

代码评论：

不要仅使用for循环来迭代大型数组，而应始终使用生成器/迭代器，例如：

    from itertools import cycle
    my_iter_arr = cycle(arr)
    for v in my_iter_arr:
        # do something

尝试将2个for循环合并为1个。我看不到为什么使用2个for循环的原因，请尝试：

for v in my_iter_arr: 
     if v > 100:
         two_d_list(particular_list_location).append(v)
     elif v < -100:
         two_d_list_2(particular_list_location).append(v)

最糟糕的是将数百万个元素追加到存储在RAM中的数组中，请避免使用two_d_list(particular_list_location).append(v)，我不确定two_d_list的效果如何！而是尝试查看列表何时达到X个元素的总和，将元素转储到文件中！并继续添加到干净列表中！

尝试阅读有关Python生成器/延迟迭代的信息

将数百万个数据保存到2D列表或CSV中，哪个更快？

1 个答案: