我在EC2实例上的Jupyter笔记本上使用Python2.7(Anaconda 4.0),内存充足(60GB,根据free
免费48GB)。我已经加载了一个庞大的Pandas(v0.18)数据帧(150K行,每行约30KB),但即使制作了很多副本,也远不及实例的内存容量。某些Pandas和Scikit-learn(v0.17)调用会立即触发MemoryError,例如:
#X is a subset of the original df with 60 columns instead of the 3000
#Y is a float column
X.add(Y)
#And for sklearn...
pca = decomposition.KernelPCA(n_components=5)
pca.fit(X,Y)
同时,这些工作正常:
Z = X.copy(deep=True)
pca = decomposition.PCA(n_components=5)
最令人困惑的是,我可以做到这一点,并在几秒钟内完成:
huge = range(1000000000)
我已经重新启动了笔记本,内核和实例,但同样的调用继续提供MemoryError
。我还验证了我使用的是64位Python。有什么建议吗?
更新:添加回溯错误:
Traceback (most recent call last):
File "<ipython-input-9-ae71777140e2>", line 2, in <module>
Z = X.add(Y)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/ops.py", line 1057, in f
return self._combine_series(other, na_op, fill_value, axis, level)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3500, in _combine_series
fill_value=fill_value)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 3528, in _combine_match_columns
copy=False)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/frame.py", line 2730, in align
broadcast_axis=broadcast_axis)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4152, in align
fill_axis=fill_axis)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/generic.py", line 4234, in _align_series
fdata = fdata.reindex_indexer(join_index, lidx, axis=0)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3528, in reindex_indexer
fill_tuple=(fill_value,))
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3591, in _slice_take_blocks_ax0
fill_value=fill_value))
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/pandas/core/internals.py", line 3621, in _make_na_block
block_values = np.empty(block_shape, dtype=dtype)
MemoryError
和
Traceback (most recent call last):
File "<ipython-input-13-d510bc16443e>", line 3, in <module>
pca.fit(X,Y)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 202, in fit
K = self._get_kernel(X)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/kernel_pca.py", line 135, in _get_kernel
filter_params=True, **params)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1347, in pairwise_kernels
return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 1054, in _parallel_pairwise
return func(X, Y, **kwds)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/metrics/pairwise.py", line 716, in linear_kernel
return safe_sparse_dot(X, Y.T, dense_output=True)
File "/home/ec2-user/anaconda2/lib/python2.7/site-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot
return fast_dot(a, b)
MemoryError
答案 0 :(得分:0)
找出问题的熊猫方面。我有一个DF和一个带有匹配索引的系列,X和Y.我想我可以将Y添加为另一个列,如下所示:
X.add(Y)
但这样做会尝试匹配列上的Y,而不是索引上的Y,从而创建150Kx150K数组。我需要提供轴:
X.add(Y, axis='index')