Question

我正在尝试将4.47GB CSV文件加载到内存映射的NumPy数组。在具有85GB RAM的GCP机器上，它大约需要花费20分钟的时间。这样做需要500秒钟，并形成1.03GB的阵列。

问题是在将文件上传到阵列过程中，它最多消耗26GB的RAM。有没有一种方法可以修改以下代码，以便在上载过程中消耗更少的RAM（如果可能的话，还可以节省时间）？

import tempfile, numpy as np

def create_memmap_ndarray_from_csv(csv_file): # load int8 csv file to int8 memory-mapped numpy array

    with open(csv_file, "r") as f:
        rows = len(f.readlines())
    with open(csv_file, "r") as f:
        cols = len(f.readline().split(','))

    memmap_file = tempfile.NamedTemporaryFile(prefix='ndarray', suffix='.memmap')
    arr_int8_mm = np.memmap(memmap_file, dtype=np.int8, mode='w+', shape=(rows,cols))

    arr_int8_mm = np.loadtxt(csv_file, dtype=np.int8, delimiter=',')
    return arr_int8_mm

Answer 1

我已根据对原始问题的评论修改了代码。更新的代码使用的内存更少：8GB而不是26GB。 loadtext, readline, split方法进一步减少了内存的使用，但是太慢了。

import tempfile, numpy as np, pandas as pd

def create_ndarray_from_csv(csv_file): # load csv file to int8 normal/memmap ndarray

    df_int8 = pd.read_csv(csv_file, dtype=np.int8, header=None)
    arr_int8 = df_int8.values
    del df_int8

    memmap_file = tempfile.NamedTemporaryFile(prefix='ndarray-memmap', suffix='.npy')
    np.save(memmap_file.name, arr_int8)
    del arr_int8

    arr_mm_int8 = np.load(memmap_file.name, mmap_mode='r')
    return arr_mm_int8

将CSV文件加载到NumPy memmap数组会占用过多内存

1 个答案: