pd.iterrows() 消耗所有内存并给出错误(进程以退出代码 137 结束(被信号 9 中断:SIGKILL))

时间:2021-04-25 19:02:24

标签: python pandas numpy

我有一个超过 750000 行和 2 列(SN,状态)的 csv 文件。
SN 是从 0 到 ~750000 的序列号,状态为 0 或 1
我正在 Pandas 中读取 csv 文件,然后读取与 SN 同名的 .npy 文件,然后将 .npy 文件附加到 2 个名为 (x_train, x_val)
的列表中 x_val 应该是 2000 个元素,其中 700 应该是 state = 1,其余的是 state = 0。 x_train 应该承担其余的。
问题是,在读取大约 190000 行后,进程停止并消耗了 RAM(PC RAM = 32 GB)

x_train len=  195260
Size of list1: 1671784bytes
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

我的代码是:

nodules_path = "~/cropped_nodules/"
nodules_csv = pandas.read_csv("~/cropped_nodules_2.csv")

positive = 0
negative = 0
x_val = []
x_train = []
y_train = []
y_val = []

for nodule in nodules_csv.iterrows():

    if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
        positive += 1
        x_val_img = str(nodule.SN) + ".npy"
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    elif nodule.state == 0 and negative <= 1300 and len(x_val) <= 2000:
        x_val_img = str(nodule.SN) + ".npy"
        negative += 1
        x_val.append(np.load(os.path.join(nodules_path,x_val_img)))
        y_val.append(nodule.state)

    else:

        if len(x_train) % 10000 == 0:
            gc.collect()
            print("gc done")
        x_train_img = str(nodule.SN) + ".npy"
        x_train.append(np.load(os.path.join(nodules_path,x_train_img)))
        y_train.append(nodule.state)
        print("x_train len= ", len(x_train))
        print("Size of list1: " + str(sys.getsizeof(x_train)) + "bytes")

我尝试使用以下内容:

  1. 手动调用 gc
  2. 使用 df.itertuples()
  3. 使用数据框 apply()

但在 ~100000 行后发生同样的问题。
我试着做pandas的向量化,但是不知道怎么做,我觉得这些条件用向量化是做不到的。
还有比这更好的方法吗?

我尝试按照@XtianP 的建议实现块

with pandas.read_csv("~/cropped_nodules_2.csv", chunksize=chunksize) as reader:
    for chunk in reader:
        for index, nodule in chunk.iterrows():
            if nodule.state == 1 and positive <= 700 and len(x_val) <= 2000 :
........

但是同样的问题发生了! (也许我的实现不正确)

也许问题不在于pandas iterrows,而是列表变得过大!
但是 sys.getsizeof(x_train) 输出只有 Size of list1: 1671784bytes

我使用跟踪内存分配如下:

import tracemalloc

tracemalloc.start()

# ... run your application ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')

print("[ Top 10 ]")
for stat in top_stats[:10]:
    print(stat)

结果是:

[ Top 10 ]
/home/mustafa/.local/lib/python3.8/site-packages/numpy/lib/format.py:741: size=11.3 GiB, count=204005, average=58.2 KiB
/home/mustafa/.local/lib/python3.8/site-packages/numpy/lib/format.py:771: size=4781 KiB, count=102002, average=48 B
/usr/local/lib/python3.8/dist-packages/pandas/core/indexes/base.py:4855: size=2391 KiB, count=102000, average=24 B
/home/mustafa/home/mustafa/project/LUNAMASK/nodule_3D_CNN.py:84: size=806 KiB, count=2, average=403 KiB
/home/mustafa/home/mustafa/project/LUNAMASK/nodule_3D_CNN.py:85: size=805 KiB, count=1, average=805 KiB
/usr/local/lib/python3.8/dist-packages/pandas/io/parsers.py:2056: size=78.0 KiB, count=2305, average=35 B
/usr/lib/python3.8/abc.py:102: size=42.5 KiB, count=498, average=87 B
/home/mustafa/.local/lib/python3.8/site-packages/numpy/core/_asarray.py:83: size=41.6 KiB, count=757, average=56 B
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py:512: size=37.5 KiB, count=597, average=64 B
/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py:1880: size=16.5 KiB, count=5, average=3373 B

0 个答案:

没有答案
相关问题