python dask,用于循环,内存过载

时间:2018-12-29 11:57:59

标签: python-3.x dask

我使用python dask-dataframe运行以下代码,但是,我只有4GB RAM。因此,它给出一个错误,指出内存错误。我想知道dask是否可以使用磁盘而不是内存,因为我的计算机内存较少。此外,在训练机器学习模型时,我们必须立即加载整个数据以训练模型。它不能并行化(除非是整体模型)。所以我想知道是否有可能在磁盘上运行它,如果计算机上没有足够的内存(我在一台计算机上运行代码) 所有数据均来自kaggle elo-商人竞争。

代码如下:

merchant_headers = historical_transactions.columns.values.tolist()

for c in range(len(merchant_headers)): 
    print(merchant_headers[c])
    print('--------------------')
    print("{}".format(historical_transactions[merchant_headers[c]].value_counts().compute()) + '\n')
    print("Number of NaN values {}".format(historical_transactions[merchant_headers[c]].isnull().sum().compute()) + '\n')

下面是输出的最后部分:

distributed.worker - WARNING - Memory use is high but worker has no data to store to disk.  Perhaps some other process is leaking memory?  Process memory: 2.89 GB -- Worker memory limit: 4.00 GB
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f570ed9a510>, <Future finished exception=TypeError('Could not serialize object of type Series.', '2017-11-24 00:00:00    26184\n2017-11-25 00:00:00    26182\n2017-11-18 00:00:00    24498\n2017-11-30 00:00:00    24430\n2017-11-17 00:00:00    23260\n2017-11-14 00:00:00    21979\n2017-11-23 00:00:00    21830\n2017-11-29 00:00:00    21137\n2017-11-16 00:00:00    20678\n2017-11-22 00:00:00    20404\n2017-11-28 00:00:00    20122\n2017-11-21 00:00:00    19618\n2017-11-03 00:00:00    19244\n2017-11-27 00:00:00    19052\n2017-10-28 00:00:00    18712\n2017-11-04 00:00:00    18455\n2017-11-01 00:00:00    18278\n2017-11-09 00:00:00    18255\n2017-11-13 00:00:00    18177\n2017-11-08 00:00:00    17740\n2017-11-20 00:00:00    17529\n2017-11-07 00:00:00    17496\n2017-10-27 00:00:00    17295\n2017-11-15 00:00:00    16460\n2017-10-31 00:00:00    16246\n2017-11-26 00:00:00    15677\n2017-10-21 00:00:00    15536\n2017-11-19 00:00:00    15118\n2017-10-26 00:00:00    14967\n2017-11-06 00:00:00    14921\n                       ...  \n2017-09-16 00:06:58        1\n2017-09-16 00:06:57        1\n2017-09-16 00:06:53        1\n2017-09-16 00:06:50        1\n2017-09-16 00:06:48        1\n2017-09-16 00:06:44        1\n2017-09-16 00:06:43        1\n2017-09-16 00:06:38        1\n2017-09-16 00:06:36        1\n2017-09-16 00:07:19        1\n2017-09-16 00:07:25        1\n2017-09-16 00:07:26        1\n2017-09-16 00:07:53        1\n2017-09-16 00:08:17        1\n2017-09-16 00:08:16        1\n2017-09-16 00:08:15        1\n2017-09-16 00:08:12        1\n2017-09-16 00:08:09        1\n2017-09-16 00:07:59        1\n2017-09-16 00:07:56        1\n2017-09-16 00:07:50        1\n2017-09-16 00:07:32        1\n2017-09-16 00:07:49        1\n2017-09-16 00:07:48        1\n2017-09-16 00:07:39        1\n2017-09-16 00:07:38        1\n2017-09-16 00:07:37        1\n2017-09-16 00:07:36        1\n2017-09-16 00:07:33        1\n2017-01-01 00:00:08        1\nName: purchase_date, Length: 16395300, dtype: int64')>)
Traceback (most recent call last):
  File "/home/michael/env/lib/python3.5/site-packages/tornado/ioloop.py", line 759, in _run_callback
    ret = callback()
  File "/home/michael/env/lib/python3.5/site-packages/tornado/stack_context.py", line 276, in null_wrapper
    return fn(*args, **kwargs)
  File "/home/michael/env/lib/python3.5/site-packages/tornado/ioloop.py", line 780, in _discard_future_result
    future.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 293, in result
    raise self._exception
  File "/home/michael/env/lib/python3.5/site-packages/tornado/gen.py", line 315, in wrapper
    yielded = next(result)
  File "/home/michael/env/lib/python3.5/site-packages/distributed/worker.py", line 2186, in memory_monitor
    k, v, weight = self.data.fast.evict()
  File "/home/michael/env/lib/python3.5/site-packages/zict/lru.py", line 90, in evict
    cb(k, v)
  File "/home/michael/env/lib/python3.5/site-packages/zict/buffer.py", line 52, in fast_to_slow
    self.slow[key] = value
  File "/home/michael/env/lib/python3.5/site-packages/zict/func.py", line 42, in __setitem__
    self.d[key] = self.dump(value)
  File "/home/michael/env/lib/python3.5/site-packages/distributed/protocol/serialize.py", line 346, in serialize_bytelist
    header, frames = serialize(x, **kwargs)
  File "/home/michael/env/lib/python3.5/site-packages/distributed/protocol/serialize.py", line 163, in serialize
    raise TypeError(msg, str(x)[:10000])
TypeError: ('Could not serialize object of type Series.', '2017-11-24 00:00:00    26184\n2017-11-25 00:00:00    26182\n2017-11-18 00:00:00    24498\n2017-11-30 00:00:00    24430\n2017-11-17 00:00:00    23260\n2017-11-14 00:00:00    21979\n2017-11-23 00:00:00    21830\n2017-11-29 00:00:00    21137\n2017-11-16 00:00:00    20678\n2017-11-22 00:00:00    20404\n2017-11-28 00:00:00    20122\n2017-11-21 00:00:00    19618\n2017-11-03 00:00:00    19244\n2017-11-27 00:00:00    19052\n2017-10-28 00:00:00    18712\n2017-11-04 00:00:00    18455\n2017-11-01 00:00:00    18278\n2017-11-09 00:00:00    18255\n2017-11-13 00:00:00    18177\n2017-11-08 00:00:00    17740\n2017-11-20 00:00:00    17529\n2017-11-07 00:00:00    17496\n2017-10-27 00:00:00    17295\n2017-11-15 00:00:00    16460\n2017-10-31 00:00:00    16246\n2017-11-26 00:00:00    15677\n2017-10-21 00:00:00    15536\n2017-11-19 00:00:00    15118\n2017-10-26 00:00:00    14967\n2017-11-06 00:00:00    14921\n                       ...  \n2017-09-16 00:06:58        1\n2017-09-16 00:06:57        1\n2017-09-16 00:06:53        1\n2017-09-16 00:06:50        1\n2017-09-16 00:06:48        1\n2017-09-16 00:06:44        1\n2017-09-16 00:06:43        1\n2017-09-16 00:06:38        1\n2017-09-16 00:06:36        1\n2017-09-16 00:07:19        1\n2017-09-16 00:07:25        1\n2017-09-16 00:07:26        1\n2017-09-16 00:07:53        1\n2017-09-16 00:08:17        1\n2017-09-16 00:08:16        1\n2017-09-16 00:08:15        1\n2017-09-16 00:08:12        1\n2017-09-16 00:08:09        1\n2017-09-16 00:07:59        1\n2017-09-16 00:07:56        1\n2017-09-16 00:07:50        1\n2017-09-16 00:07:32        1\n2017-09-16 00:07:49        1\n2017-09-16 00:07:48        1\n2017-09-16 00:07:39        1\n2017-09-16 00:07:38        1\n2017-09-16 00:07:37        1\n2017-09-16 00:07:36        1\n2017-09-16 00:07:33        1\n2017-01-01 00:00:08        1\nName: purchase_date, Length: 16395300, dtype: int64')

distributed.scheduler - ERROR - Workers don't have promised key: ['inproc://192.168.1.2/1271/2'], ('value-counts-agg-ef7612c5460d97ec66a1f75a33de8c6d', 0)
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {"('value-counts-agg-ef7612c5460d97ec66a1f75a33de8c6d', 0)": ['inproc://192.168.1.2/1271/2']}

有人可以帮我解决这个问题吗?

谢谢

迈克尔

1 个答案:

答案 0 :(得分:1)

增加分区数,以获取每个partition to 100MB

有两种方法。在读取数据时:

ddf = dd.from_pandas(df, npartitions=100)

然后

repartition

ddf = ddf.repartition(npartitions=100)