我有一个数据框,我已将日期转换为月份,将数字更改为字符串,即4月,5月。然后我按月分组以获得较小的数据集。然而,这几个月的顺序是错误的,我想让他们从4月到3月。
我想按列排序,但似乎在我对其进行分组之后月份不再是列,所以我认为它是一个索引。
这是我的代码:
aDF = pd.DataFrame(e)
rgDF = aDF[['GrantRefNumber','Call','FirstReceivedDate','TotalGrantValue']]
rgaDF = rgDF.drop('GrantRefNumber',1)
rgbDF = rgaDF.drop('Call',1)
rgbDF['month'] = pd.DatetimeIndex(rgbDF['FirstReceivedDate']).month
rgbDF['month'] = rgbDF['month'].apply(lambda x: calendar.month_abbr[x])
gA = rgbDF.groupby('month').agg({'TotalGrantValue':sum, 'FirstReceivedDate':'count'}).rename(columns={'FirstReceivedDate':'Count'})
gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format)
产生这个:
Count TotalGrantValue
month
Apr 29 14,039,166.51
Aug 32 15,340,273.93
Dec 28 14,801,964.91
Feb 28 15,946,952.06
Jan 33 17,820,324.72
Jul 31 16,870,364.57
Jun 37 18,472,945.88
Mar 39 22,387,224.85
May 24 11,517,789.48
Nov 31 16,761,506.64
Oct 28 12,965,535.58
Sep 34 16,752,631.34
我正在尝试这个,但它不起作用(抛出错误):
gA.sort_index(by=['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar'])
gA
完全错误:
KeyErrorTraceback (most recent call last)
<ipython-input-43-66c4870a7499> in <module>()
9
10 vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar']
---> 11 gA = gA.set_index('month').reindex(a).reset_index()
C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
2926 names.append(None)
2927 else:
-> 2928 level = frame[col]._values
2929 names.append(col)
2930 if drop:
C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
2060 return self._getitem_multilevel(key)
2061 else:
-> 2062 return self._getitem_column(key)
2063
2064 def _getitem_column(self, key):
C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
2067 # get column
2068 if self.columns.is_unique:
-> 2069 return self._get_item_cache(key)
2070
2071 # duplicate columns & possible reduce dimensionality
C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
1532 res = cache.get(item)
1533 if res is None:
-> 1534 values = self._data.get(item)
1535 res = self._box_item_values(item, values)
1536 cache[item] = res
C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
3588
3589 if not isnull(item):
-> 3590 loc = self.items.get_loc(item)
3591 else:
3592 indexer = np.arange(len(self.items))[isnull(self.items)]
C:\Anaconda\lib\site-packages\pandas\core\indexes\base.pyc in get_loc(self, key, method, tolerance)
2393 return self._engine.get_loc(key)
2394 except KeyError:
-> 2395 return self._engine.get_loc(self._maybe_cast_indexer(key))
2396
2397 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5239)()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5085)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20405)()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20359)()
KeyError: 'month'
答案 0 :(得分:1)
我认为您需要strftime
才能转换为months
:
What is the difference between size and count in pandas?
#sample data
rng = pd.date_range('2017-04-03', periods=20, freq='20d')
aDF = pd.DataFrame({'FirstReceivedDate': rng, 'TotalGrantValue': range(20)})
print (aDF)
FirstReceivedDate TotalGrantValue
0 2017-04-03 0
1 2017-04-23 1
2 2017-05-13 2
3 2017-06-02 3
4 2017-06-22 4
5 2017-07-12 5
6 2017-08-01 6
7 2017-08-21 7
8 2017-09-10 8
9 2017-09-30 9
10 2017-10-20 10
11 2017-11-09 11
12 2017-11-29 12
13 2017-12-19 13
14 2018-01-08 14
15 2018-01-28 15
16 2018-02-17 16
17 2018-03-09 17
18 2018-03-29 18
19 2018-04-18 19
#filter only data what need, so drop is not necessary
rgbDF = aDF[['FirstReceivedDate','TotalGrantValue']]
rgbDF['month'] = pd.to_datetime(rgbDF['FirstReceivedDate']).dt.strftime('%b')
gA = rgbDF.groupby('month') \
.agg({'TotalGrantValue':'sum', 'FirstReceivedDate':'count'}) \
.rename(columns={'FirstReceivedDate':'Count'})
gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format)
print (gA)
TotalGrantValue Count
month
Apr 20.00 3
Aug 38.00 3
Dec 13.00 1
Feb 16.00 1
Jan 29.00 2
Jul 52.00 3
Jun 29.00 3
Mar 35.00 2
May 43.00 3
Nov 52.00 3
Oct 38.00 2
Sep 70.00 4
months
vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar']
gA = gA.reindex(vals)
print (gA)
TotalGrantValue Count
month
Apr 20.00 3
May 2.00 1
Jun 7.00 2
Jul 5.00 1
Aug 13.00 2
Sep 17.00 2
Oct 10.00 1
Nov 23.00 2
Dec 13.00 1
Jan 29.00 2
Feb 16.00 1
Mar 35.00 2
之后可以使用groupby
自定义列表:
groupby
另一个解决方案是在列表中使用reindex
和catagories,因此在rgbDF = aDF[['FirstReceivedDate','TotalGrantValue']]
vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar']
rgbDF['month'] = pd.Categorical(rgbDF['FirstReceivedDate'].dt.strftime('%b'),
categories=vals, ordered=True)
gA = rgbDF.groupby('month')
.agg({'TotalGrantValue':'sum', 'FirstReceivedDate':'count'})
.rename(columns={'FirstReceivedDate':'Count'})
gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format)
print (gA)
TotalGrantValue Count
month
Apr 20.00 3
May 2.00 1
Jun 7.00 2
Jul 5.00 1
Aug 13.00 2
Sep 17.00 2
Oct 10.00 1
Nov 23.00 2
Dec 13.00 1
Jan 29.00 2
Feb 16.00 1
Mar 35.00 2
输出后根据需要进行排序:
with
答案 1 :(得分:1)
Jezrael的答案中的一个小改动将适用于您的情况:
gA = rgbDF.groupby('month').agg({'TotalGrantValue':sum, 'FirstReceivedDate':'count'}).rename(columns={'FirstReceivedDate':'Count'})
gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format)
vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar']
gA = gA.set_index(gA.index).reindex(vals).reset_index()