这是对这个问题的答案的后续问题:
pandas performance issue - need help to optimize
以下建议有效:
df = DataFrame(np.arange(20).reshape(5,4))
df2 = df.set_index(keys=[0,1,2])
df2.ix[(4,5,6)]
使用MultiIndex
所以我创建了一个文件sample_data.csv,如下所示:
col1,col2,year,amount
111111,3.5,2012,700
111112,3.5,2011,600
222221,4.0,2012,222
...
然后我运行了以下内容:
import numpy as np
import pandas as pd
sd=pd.read_csv('sample_data.csv')
sd2=sd.set_index(keys=['col2','year'])
sd2.ix[(4.0,2012)]
但这会产生以下错误: IndexError:索引越界
为什么它适用于前一种情况而不是后一种情况的任何想法? 这就是错误的样子:
IndexError Traceback (most recent call last)
<ipython-input-19-1d72a961db95> in <module>()
----> 1 sd2.ix[(4.0,2012)]
/Library/Python/2.7/site-packages/pandas-0.8.1-py2.7-macosx-10.7-intel.egg/pandas/core/indexing.pyc in __getitem__(self, key)
31 pass
32
---> 33 return self._getitem_tuple(key)
34 else:
35 return self._getitem_axis(key, axis=0)
答案 0 :(得分:1)
显示它适合我(pandas 0.10.1):
In [1]: from StringIO import StringIO
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: s = StringIO("""col1,col2,year,amount
...: 111111,3.5,2012,700
...: 111112,3.5,2011,600
...: 222221,4.0,2012,222""")
In [5]: sd=pd.read_csv(s)
In [6]: sd2=sd.set_index(keys=['col2','year'])
In [7]: sd2.ix[(4.0,2012)]
Out[7]:
col1 222221
amount 222
Name: (4.0, 2012)
但是,如果我添加一行重复索引,我也会收到同样的错误:
In [8]: s = StringIO("""col1,col2,year,amount
...: 111111,3.5,2012,700
...: 111112,3.5,2011,600
...: 222221,4.0,2012,222
...: 222221,4.0,2012,223""")
In [9]: sd=pd.read_csv(s)
In [10]: sd2=sd.set_index(keys=['col2','year'])
In [11]: sd2.ix[(4.0,2012)]
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-7-1b787a1d99df> in <module>()
----> 1 sd2.ix[(4.0,2012)]
C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
32 pass
33
---> 34 return self._getitem_tuple(key)
35 else:
36 return self._getitem_axis(key, axis=0)
...
IndexError: index out of bounds
您是否可能在('col1','year')中有重复值?
我不知道它是一个bug还是只是对MultiIndex的约束(但在这种情况下,我认为错误信息可能更清晰)。但您可以在设置索引之前删除重复值,如下所示:
In [21]: sd=pd.read_csv(s)
In [22]: sd = sd.drop_duplicates(['col2', 'year'])
In [23]: sd2=sd.set_index(keys=['col2','year'])
In [24]: sd2.ix[(4.0,2012)]
Out[24]:
col1 222221
amount 222
Name: (4.0, 2012)
有关详情,请参阅:http://pandas.pydata.org/pandas-docs/stable/indexing.html#duplicate-data和http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop_duplicates.html。