以下是IndexError:index out of bounds:
import pandas as pd
from numpy import nan
df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-11'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'EndTime': {0: nan,1: nan, 2: nan, 3: nan, 4: nan, 5: nan}})
df1.groupby(['Stock','EndTime']).head(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/users/.../egg_cache/p/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 994, in head
in_head = self._cumcount_array() < n
File "/users/.../egg_cache/p/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 1034, in _cumcount_array
arr = np.arange(self.grouper._max_groupsize, dtype='int64')
File "pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:41917)
File "/users/.../egg_cache/p/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 1343, in _max_groupsize
if self.indices:
File "pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:41917)
File "/users/.../egg_cache/p/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 1309, in indices
return _get_indices_dict(label_list, keys)
File "/users/.../egg_cache/p/pandas-0.16.2-py2.7-linux-x86_64.egg/pandas/core/groupby.py", line 3767, in _get_indices_dict
return lib.indices_fast(sorter, group_index, keys, sorted_labels)
File "pandas/lib.pyx", line 1385, in pandas.lib.indices_fast (pandas/lib.c:23875)
File "pandas/src/util.pxd", line 41, in util.get_value_at (pandas/lib.c:62901)
IndexError: index out of bounds
但是,如果我排除所有NaN列,它可以正常工作,如下所示:
df1.groupby(['Stock','Date']).head(1)
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
5 2016-10-11 NaN 10:08:36.657 XYZ
任何想法,如果这是熊猫的错误或我在这里遗漏了什么。我正在阅读以下内容:https://github.com/pandas-dev/pandas/issues/11016
如果它是一个错误,任何建议的解决方法,假设摆脱所有Nan列不是一个选项。
更有趣的观察结果:
df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-11'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'EndTime': {0: nan,1: nan, 2: 1, 3: nan, 4: nan, 5: nan}})
print df1
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
1 2016-10-11 NaN 08:00:00.243 ABC
2 2016-10-11 1 12:34:23.563 ABC
3 2016-10-11 NaN 08:14.05.908 ABC
4 2016-10-11 NaN 18:54:50.100 ABC
5 2016-10-11 NaN 10:08:36.657 XYZ
df1.groupby(['Stock','EndTime']).head(1)
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
2 2016-10-11 1 12:34:23.563 ABC
以上输出对我来说不正确。不应该是:
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
2 2016-10-11 1 12:34:23.563 ABC
5 2016-10-11 NaN 10:08:36.657 XYZ
现在针对以下情况:
df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-11'}, 'Stock': {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'EndTime': {0: nan,1: nan, 2: nan, 3: nan, 4: nan, 5: 1}})
print df1
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
1 2016-10-11 NaN 08:00:00.243 ABC
2 2016-10-11 NaN 12:34:23.563 ABC
3 2016-10-11 NaN 08:14.05.908 ABC
4 2016-10-11 NaN 18:54:50.100 ABC
5 2016-10-11 1 10:08:36.657 XYZ
df1.groupby(['Stock','EndTime']).head(1)
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
5 2016-10-11 1 10:08:36.657 XYZ
这个很好。
答案 0 :(得分:0)
@Rahul,这是使用Pandas 0.19.0时代码的输出:
In [5]: df1
Out[5]:
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
1 2016-10-11 NaN 08:00:00.243 ABC
2 2016-10-11 NaN 12:34:23.563 ABC
3 2016-10-11 NaN 08:14.05.908 ABC
4 2016-10-11 NaN 18:54:50.100 ABC
5 2016-10-11 NaN 10:08:36.657 XYZ
In [6]: df1.groupby(['Stock','EndTime']).head(1)
Out[6]:
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
In [7]: df1.groupby(['Stock','Date']).head(1)
Out[7]:
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
5 2016-10-11 NaN 10:08:36.657 XYZ
In [8]: df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-11'}, 'Stock': {
...: 0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:14
...: .05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'EndTime': {0: nan,1: nan, 2: 1, 3: nan, 4: nan, 5: nan}})
...:
In [9]: df1.groupby(['Stock','EndTime']).head(1)
Out[9]:
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
2 2016-10-11 1.0 12:34:23.563 ABC
In [10]: df1 = pd.DataFrame({'Date': {0: '2016-10-11', 1: '2016-10-11', 2: '2016-10-11', 3: '2016-10-11', 4: '2016-10-11',5: '2016-10-11'}, 'Stock':
...: {0: 'ABC', 1: 'ABC', 2: 'ABC', 3: 'ABC', 4: 'ABC', 5: 'XYZ'}, 'StartTime': {0: '08:00:00.241', 1: '08:00:00.243', 2: '12:34:23.563', 3: '08:
...: 14.05.908', 4: '18:54:50.100', 5: '10:08:36.657'}, 'EndTime': {0: nan,1: nan, 2: nan, 3: nan, 4: nan, 5: 1}})
...:
In [11]: df1.groupby(['Stock','EndTime']).head(1)
Out[11]:
Date EndTime StartTime Stock
0 2016-10-11 NaN 08:00:00.241 ABC
5 2016-10-11 1.0 10:08:36.657 XYZ