我正在编写一个非常简单的dask
程序,该程序首先仅从磁盘读取csv,然后重置结果的索引。最后将结果写入实木复合地板文件。但是,如果我在.reset_index()
方法之后直接添加read_csv
。程序运行失败。如果我先阅读csv并将其直接写入镶木地板文件中。然后,我启动一个新程序,该程序读取镶木地板文件并重置索引。它运行成功。我感到很困惑。
我的dask
建立在两个节点上。每个节点有一个有4个进程的工作程序,每个进程有1个线程。每个进程有4GB。每个节点总共有32GB。总共有60个文件,每个文件约为142MB。
# code below will run failed
df = dd.read_csv(raw_user_info_path + "/*", sep="|", header=None, names=field_names, blocksize="100MB").reste_index("phone_number")
df.to_parquet(user_info_sep_dir)
工人抛出如下异常:
distributed.nanny - WARNING - Worker exceeded 95% memory budget. Restarting
distributed.nanny - INFO - Worker process 49377 was killed by signal 15
distributed.worker - ERROR - Worker stream died during communication: tcp://192.168.0.1:45595
Traceback (most recent call last):
File "/home/lodap/anaconda3/lib/python3.7/site-packages/distributed/comm/tcp.py", line 188, in read
n_frames = await stream.read_bytes(8)
tornado.iostream.StreamClosedError: Stream is closed
然后我将程序分为两部分:
1.py
如下:
df = dd.read_csv(raw_user_info_path + "/*", sep="|", header=None, names=field_names, blocksize="100MB")
df.to_parquet(tmp)
2.py
如下:
df = dd.read_parquet(raw_user_info_path).reste_index("phone_number")
df.to_parquet(final_dir)
如果我运行这两个程序,那么它将成功。