我使用python dask-dataframe运行以下代码,但是,我只有4GB RAM。因此,它给出一个错误,指出内存错误。我想知道dask是否可以使用磁盘而不是内存,因为我的计算机内存较少。此外,在训练机器学习模型时,我们必须立即加载整个数据以训练模型。它不能并行化(除非是整体模型)。所以我想知道是否有可能在磁盘上运行它,如果计算机上没有足够的内存(我在一台计算机上运行代码) 所有数据均来自kaggle elo-商人竞争。
代码如下:
merchant_headers = historical_transactions.columns.values.tolist()
for c in range(len(merchant_headers)):
print(merchant_headers[c])
print('--------------------')
print("{}".format(historical_transactions[merchant_headers[c]].value_counts().compute()) + '\n')
print("Number of NaN values {}".format(historical_transactions[merchant_headers[c]].isnull().sum().compute()) + '\n')
下面是输出的最后部分:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 2.89 GB -- Worker memory limit: 4.00 GB
tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f570ed9a510>, <Future finished exception=TypeError('Could not serialize object of type Series.', '2017-11-24 00:00:00 26184\n2017-11-25 00:00:00 26182\n2017-11-18 00:00:00 24498\n2017-11-30 00:00:00 24430\n2017-11-17 00:00:00 23260\n2017-11-14 00:00:00 21979\n2017-11-23 00:00:00 21830\n2017-11-29 00:00:00 21137\n2017-11-16 00:00:00 20678\n2017-11-22 00:00:00 20404\n2017-11-28 00:00:00 20122\n2017-11-21 00:00:00 19618\n2017-11-03 00:00:00 19244\n2017-11-27 00:00:00 19052\n2017-10-28 00:00:00 18712\n2017-11-04 00:00:00 18455\n2017-11-01 00:00:00 18278\n2017-11-09 00:00:00 18255\n2017-11-13 00:00:00 18177\n2017-11-08 00:00:00 17740\n2017-11-20 00:00:00 17529\n2017-11-07 00:00:00 17496\n2017-10-27 00:00:00 17295\n2017-11-15 00:00:00 16460\n2017-10-31 00:00:00 16246\n2017-11-26 00:00:00 15677\n2017-10-21 00:00:00 15536\n2017-11-19 00:00:00 15118\n2017-10-26 00:00:00 14967\n2017-11-06 00:00:00 14921\n ... \n2017-09-16 00:06:58 1\n2017-09-16 00:06:57 1\n2017-09-16 00:06:53 1\n2017-09-16 00:06:50 1\n2017-09-16 00:06:48 1\n2017-09-16 00:06:44 1\n2017-09-16 00:06:43 1\n2017-09-16 00:06:38 1\n2017-09-16 00:06:36 1\n2017-09-16 00:07:19 1\n2017-09-16 00:07:25 1\n2017-09-16 00:07:26 1\n2017-09-16 00:07:53 1\n2017-09-16 00:08:17 1\n2017-09-16 00:08:16 1\n2017-09-16 00:08:15 1\n2017-09-16 00:08:12 1\n2017-09-16 00:08:09 1\n2017-09-16 00:07:59 1\n2017-09-16 00:07:56 1\n2017-09-16 00:07:50 1\n2017-09-16 00:07:32 1\n2017-09-16 00:07:49 1\n2017-09-16 00:07:48 1\n2017-09-16 00:07:39 1\n2017-09-16 00:07:38 1\n2017-09-16 00:07:37 1\n2017-09-16 00:07:36 1\n2017-09-16 00:07:33 1\n2017-01-01 00:00:08 1\nName: purchase_date, Length: 16395300, dtype: int64')>)
Traceback (most recent call last):
File "/home/michael/env/lib/python3.5/site-packages/tornado/ioloop.py", line 759, in _run_callback
ret = callback()
File "/home/michael/env/lib/python3.5/site-packages/tornado/stack_context.py", line 276, in null_wrapper
return fn(*args, **kwargs)
File "/home/michael/env/lib/python3.5/site-packages/tornado/ioloop.py", line 780, in _discard_future_result
future.result()
File "/usr/lib/python3.5/asyncio/futures.py", line 293, in result
raise self._exception
File "/home/michael/env/lib/python3.5/site-packages/tornado/gen.py", line 315, in wrapper
yielded = next(result)
File "/home/michael/env/lib/python3.5/site-packages/distributed/worker.py", line 2186, in memory_monitor
k, v, weight = self.data.fast.evict()
File "/home/michael/env/lib/python3.5/site-packages/zict/lru.py", line 90, in evict
cb(k, v)
File "/home/michael/env/lib/python3.5/site-packages/zict/buffer.py", line 52, in fast_to_slow
self.slow[key] = value
File "/home/michael/env/lib/python3.5/site-packages/zict/func.py", line 42, in __setitem__
self.d[key] = self.dump(value)
File "/home/michael/env/lib/python3.5/site-packages/distributed/protocol/serialize.py", line 346, in serialize_bytelist
header, frames = serialize(x, **kwargs)
File "/home/michael/env/lib/python3.5/site-packages/distributed/protocol/serialize.py", line 163, in serialize
raise TypeError(msg, str(x)[:10000])
TypeError: ('Could not serialize object of type Series.', '2017-11-24 00:00:00 26184\n2017-11-25 00:00:00 26182\n2017-11-18 00:00:00 24498\n2017-11-30 00:00:00 24430\n2017-11-17 00:00:00 23260\n2017-11-14 00:00:00 21979\n2017-11-23 00:00:00 21830\n2017-11-29 00:00:00 21137\n2017-11-16 00:00:00 20678\n2017-11-22 00:00:00 20404\n2017-11-28 00:00:00 20122\n2017-11-21 00:00:00 19618\n2017-11-03 00:00:00 19244\n2017-11-27 00:00:00 19052\n2017-10-28 00:00:00 18712\n2017-11-04 00:00:00 18455\n2017-11-01 00:00:00 18278\n2017-11-09 00:00:00 18255\n2017-11-13 00:00:00 18177\n2017-11-08 00:00:00 17740\n2017-11-20 00:00:00 17529\n2017-11-07 00:00:00 17496\n2017-10-27 00:00:00 17295\n2017-11-15 00:00:00 16460\n2017-10-31 00:00:00 16246\n2017-11-26 00:00:00 15677\n2017-10-21 00:00:00 15536\n2017-11-19 00:00:00 15118\n2017-10-26 00:00:00 14967\n2017-11-06 00:00:00 14921\n ... \n2017-09-16 00:06:58 1\n2017-09-16 00:06:57 1\n2017-09-16 00:06:53 1\n2017-09-16 00:06:50 1\n2017-09-16 00:06:48 1\n2017-09-16 00:06:44 1\n2017-09-16 00:06:43 1\n2017-09-16 00:06:38 1\n2017-09-16 00:06:36 1\n2017-09-16 00:07:19 1\n2017-09-16 00:07:25 1\n2017-09-16 00:07:26 1\n2017-09-16 00:07:53 1\n2017-09-16 00:08:17 1\n2017-09-16 00:08:16 1\n2017-09-16 00:08:15 1\n2017-09-16 00:08:12 1\n2017-09-16 00:08:09 1\n2017-09-16 00:07:59 1\n2017-09-16 00:07:56 1\n2017-09-16 00:07:50 1\n2017-09-16 00:07:32 1\n2017-09-16 00:07:49 1\n2017-09-16 00:07:48 1\n2017-09-16 00:07:39 1\n2017-09-16 00:07:38 1\n2017-09-16 00:07:37 1\n2017-09-16 00:07:36 1\n2017-09-16 00:07:33 1\n2017-01-01 00:00:08 1\nName: purchase_date, Length: 16395300, dtype: int64')
distributed.scheduler - ERROR - Workers don't have promised key: ['inproc://192.168.1.2/1271/2'], ('value-counts-agg-ef7612c5460d97ec66a1f75a33de8c6d', 0)
NoneType: None
distributed.client - WARNING - Couldn't gather 1 keys, rescheduling {"('value-counts-agg-ef7612c5460d97ec66a1f75a33de8c6d', 0)": ['inproc://192.168.1.2/1271/2']}
有人可以帮我解决这个问题吗?
谢谢
迈克尔
答案 0 :(得分:1)
增加分区数,以获取每个partition to 100MB。
有两种方法。在读取数据时:
ddf = dd.from_pandas(df, npartitions=100)
ddf = ddf.repartition(npartitions=100)