熊猫 - 索引上的数据帧排序(带字)

时间:2017-07-18 11:08:01

标签: python pandas

我有一个数据框,我已将日期转换为月份,将数字更改为字符串,即4月,5月。然后我按月分组以获得较小的数据集。然而,这几个月的顺序是错误的,我想让他们从4月到3月。

我想按列排序,但似乎在我对其进行分组之后月份不再是列,所以我认为它是一个索引。

这是我的代码:

aDF = pd.DataFrame(e)
rgDF = aDF[['GrantRefNumber','Call','FirstReceivedDate','TotalGrantValue']]
rgaDF = rgDF.drop('GrantRefNumber',1)
rgbDF = rgaDF.drop('Call',1)

rgbDF['month'] = pd.DatetimeIndex(rgbDF['FirstReceivedDate']).month
rgbDF['month'] = rgbDF['month'].apply(lambda x: calendar.month_abbr[x])

gA = rgbDF.groupby('month').agg({'TotalGrantValue':sum, 'FirstReceivedDate':'count'}).rename(columns={'FirstReceivedDate':'Count'})
gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format)

产生这个:

Count   TotalGrantValue
        month       
        Apr 29  14,039,166.51
        Aug 32  15,340,273.93
        Dec 28  14,801,964.91
        Feb 28  15,946,952.06
        Jan 33  17,820,324.72
        Jul 31  16,870,364.57
        Jun 37  18,472,945.88
        Mar 39  22,387,224.85
        May 24  11,517,789.48
        Nov 31  16,761,506.64
        Oct 28  12,965,535.58
        Sep 34  16,752,631.34

我正在尝试这个,但它不起作用(抛出错误):

gA.sort_index(by=['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar'])
gA

完全错误:

KeyErrorTraceback (most recent call last)
<ipython-input-43-66c4870a7499> in <module>()
      9 
     10 vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar']
---> 11 gA = gA.set_index('month').reindex(a).reset_index()

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2926                 names.append(None)
   2927             else:
-> 2928                 level = frame[col]._values
   2929                 names.append(col)
   2930                 if drop:

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   2060             return self._getitem_multilevel(key)
   2061         else:
-> 2062             return self._getitem_column(key)
   2063 
   2064     def _getitem_column(self, key):

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
   2067         # get column
   2068         if self.columns.is_unique:
-> 2069             return self._get_item_cache(key)
   2070 
   2071         # duplicate columns & possible reduce dimensionality

C:\Anaconda\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
   1532         res = cache.get(item)
   1533         if res is None:
-> 1534             values = self._data.get(item)
   1535             res = self._box_item_values(item, values)
   1536             cache[item] = res

C:\Anaconda\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   3588 
   3589             if not isnull(item):
-> 3590                 loc = self.items.get_loc(item)
   3591             else:
   3592                 indexer = np.arange(len(self.items))[isnull(self.items)]

C:\Anaconda\lib\site-packages\pandas\core\indexes\base.pyc in get_loc(self, key, method, tolerance)
   2393                 return self._engine.get_loc(key)
   2394             except KeyError:
-> 2395                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2396 
   2397         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5239)()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas\_libs\index.c:5085)()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20405)()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas\_libs\hashtable.c:20359)()

KeyError: 'month'

2 个答案:

答案 0 :(得分:1)

我认为您需要strftime才能转换为months

What is the difference between size and count in pandas?

#sample data
rng = pd.date_range('2017-04-03', periods=20, freq='20d')
aDF = pd.DataFrame({'FirstReceivedDate': rng, 'TotalGrantValue': range(20)})  
print (aDF)
   FirstReceivedDate  TotalGrantValue
0         2017-04-03                0
1         2017-04-23                1
2         2017-05-13                2
3         2017-06-02                3
4         2017-06-22                4
5         2017-07-12                5
6         2017-08-01                6
7         2017-08-21                7
8         2017-09-10                8
9         2017-09-30                9
10        2017-10-20               10
11        2017-11-09               11
12        2017-11-29               12
13        2017-12-19               13
14        2018-01-08               14
15        2018-01-28               15
16        2018-02-17               16
17        2018-03-09               17
18        2018-03-29               18
19        2018-04-18               19
#filter only data what need, so drop is not necessary     
rgbDF = aDF[['FirstReceivedDate','TotalGrantValue']]

rgbDF['month'] = pd.to_datetime(rgbDF['FirstReceivedDate']).dt.strftime('%b')

gA = rgbDF.groupby('month') \
          .agg({'TotalGrantValue':'sum', 'FirstReceivedDate':'count'}) \
          .rename(columns={'FirstReceivedDate':'Count'})

gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format)
print (gA)
      TotalGrantValue  Count
month                       
Apr             20.00      3
Aug             38.00      3
Dec             13.00      1
Feb             16.00      1
Jan             29.00      2
Jul             52.00      3
Jun             29.00      3
Mar             35.00      2
May             43.00      3
Nov             52.00      3
Oct             38.00      2
Sep             70.00      4

months vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar'] gA = gA.reindex(vals) print (gA) TotalGrantValue Count month Apr 20.00 3 May 2.00 1 Jun 7.00 2 Jul 5.00 1 Aug 13.00 2 Sep 17.00 2 Oct 10.00 1 Nov 23.00 2 Dec 13.00 1 Jan 29.00 2 Feb 16.00 1 Mar 35.00 2 之后可以使用groupby自定义列表:

groupby

另一个解决方案是在列表中使用reindex和catagories,因此在rgbDF = aDF[['FirstReceivedDate','TotalGrantValue']] vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar'] rgbDF['month'] = pd.Categorical(rgbDF['FirstReceivedDate'].dt.strftime('%b'), categories=vals, ordered=True) gA = rgbDF.groupby('month') .agg({'TotalGrantValue':'sum', 'FirstReceivedDate':'count'}) .rename(columns={'FirstReceivedDate':'Count'}) gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format) print (gA) TotalGrantValue Count month Apr 20.00 3 May 2.00 1 Jun 7.00 2 Jul 5.00 1 Aug 13.00 2 Sep 17.00 2 Oct 10.00 1 Nov 23.00 2 Dec 13.00 1 Jan 29.00 2 Feb 16.00 1 Mar 35.00 2 输出后根据需要进行排序:

with

答案 1 :(得分:1)

Jezrael的答案中的一个小改动将适用于您的情况:

  gA = rgbDF.groupby('month').agg({'TotalGrantValue':sum, 'FirstReceivedDate':'count'}).rename(columns={'FirstReceivedDate':'Count'})
  gA['TotalGrantValue'] = gA['TotalGrantValue'].map('{:,.2f}'.format)
  vals = ['Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Jan', 'Feb', 'Mar']
  gA = gA.set_index(gA.index).reindex(vals).reset_index()