我安装了这些软件包:
python: 2.7.3.final.0
python-bits: 64
OS: Linux
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.13.1
这是数据框信息:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 421570 entries, 2010-02-05 00:00:00 to 2012-10-26 00:00:00
Data columns (total 5 columns):
Store 421570 non-null int64
Dept 421570 non-null int64
Weekly_Sales 421570 non-null float64
IsHoliday 421570 non-null bool
Date_Str 421570 non-null object
dtypes: bool(1), float64(1), int64(2), object(1)None
这是数据的样子:
Store,Dept,Date,Weekly_Sales,IsHoliday
1,1,2010-02-05,24924.5,FALSE
1,1,2010-02-12,46039.49,TRUE
1,1,2010-02-19,41595.55,FALSE
1,1,2010-02-26,19403.54,FALSE
1,1,2010-03-05,21827.9,FALSE
1,1,2010-03-12,21043.39,FALSE
1,1,2010-03-19,22136.64,FALSE
1,1,2010-03-26,26229.21,FALSE
1,1,2010-04-02,57258.43,FALSE
我加载文件并将其编入索引如下:
df_train = pd.read_csv('train.csv')
df_train['Date_Str'] = df_train['Date']
df_train['Date'] = pd.to_datetime(df_train['Date'])
df_train = df_train.set_index(['Date'])
当我使用400K行文件进行以下操作时,
df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
或
df_train['try'] = df_train['Store'] * df_train['Dept']
导致错误:
Traceback (most recent call last):
File "rock.py", line 85, in <module>
rock.pandasTest()
File "rock.py", line 31, in pandasTest
df_train['_id'] = df_train['Store'].astype(str) +'_' + df_train['Dept'].astype('str')
File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/ops.py", line 480, in wrapper
return_indexers=True)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tseries/index.py", line 976, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1304, in join
return_indexers=return_indexers)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/core/index.py", line 1345, in _join_non_unique
how=how, sort=True)
File "/usr/local/lib/python2.7/dist-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/tools/merge.py", line 465, in _get_join_indexers
return join_func(left_group_key, right_group_key, max_groups)
File "join.pyx", line 152, in pandas.algos.full_outer_join (pandas/algos.c:34716)
MemoryError
但是,它适用于小文件。
答案 0 :(得分:2)
我也可以在0.13.1上重现它,但问题不会发生在0.12或0.14(昨天发布)中,所以它似乎是0.13中的一个错误。
所以,也许尝试升级您的熊猫版本,因为矢量化方式比应用(我的机器上5s vs&gt; 1min)快得多,并且使用较少的峰值内存(200Mb vs 980Mb,%memit)0.14
使用您的样本数据重复50000次(导致450k行的df),并使用@jsalonen的apply_id
函数:
In [23]: pd.__version__
Out[23]: '0.14.0'
In [24]: %timeit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
1 loops, best of 3: 5.42 s per loop
In [25]: %timeit df_train.apply(apply_id, 1)
1 loops, best of 3: 1min 11s per loop
In [26]: %load_ext memory_profiler
In [27]: %memit df_train['Store'].astype(str) +'_' + df_train['Dept'].astype(str)+'_'+ df_train['Date_Str'].astype(str)
peak memory: 201.75 MiB, increment: 0.01 MiB
In [28]: %memit df_train.apply(apply_id, 1)
peak memory: 982.56 MiB, increment: 780.79 MiB
答案 1 :(得分:1)
尝试使用DataFrame.apply调用生成_id
字段:
def apply_id(x):
x['_id'] = "{}_{}_{}".format(x['Store'], x['Dept'], x['Date_Str'])
return x
df_train = df_train.apply(apply_id, 1)
使用apply
时,每行执行id生成,从而减少内存分配的开销。