按月份和年份对大熊猫的数据框系列进行排序?

时间:2019-12-10 11:59:50

标签: python pandas pandas-groupby

我对python和使用pandas非常陌生,并且正在努力工作。

我有一个数据框,其中包含发生不同事件的日期。我想根据事件发生的时间(月份和年份)对事件进行分类/排序。每个事件都有一个对应的计数。我想及时对数据进行排序,并获取每个月发生的事件总数。下面显示了我尝试使用的代码。

df = df.sort_values(by='Date')
all_events_dates=df.groupby(df['Date'].dt.strftime('%B %Y'))['Count'].sum()
month_year = ['January 2013','February 2013','March 2013', 'April 2013','May 2013','June 2013','July 2013','August 2013','September 2013','October 2013','November 2013','December 2013','January 2014','February 2014','March 2014', 'April 2014','May 2014','June 2014','July 2014','August 2014','September 2014','October 2014','November 2014','December 2014','January 2015','February 2015','March 2015', 'April 2015','May 2015','June 2015','July 2015','August 2015','September 2015','October 2015','November 2015','December 2015']
all_events_dates.index = pd.CategoricalIndex(set(df['Date'].dt.strftime('%B %Y')), categories=month_year, ordered=True)
all_events_dates = all_events_dates.sort_index()
print(all_events_dates)

输出结果如下(删减了部分月份):

-January 2013       2
-February 2013      6
-March 2013        12
-April 2013        11
-May 2013           2
-June 2013          4
-July 2013          2
-September 2013     1
...
-August 2015        3
-September 2015     8
-October 2015       7
-November 2015      4
-December 2015      2

哪个仓位不正确(每月的频率不正确)。当我打开一个新项目并想要尝试相同的事情时,我没有使用事件进行排序,而是使用时间来对位置进行排序:

df = df.sort_values(by='Date')
all_loc_dates=df.groupby(df['Date'].dt.strftime('%B %Y'))['Count'].sum()
month_year = ['January 2013','February 2013','March 2013', 'April 2013','May 2013','June 2013','July 2013','August 2013','September 2013','October 2013','November 2013','December 2013','January 2014','February 2014','March 2014', 'April 2014','May 2014','June 2014','July 2014','August 2014','September 2014','October 2014','November 2014','December 2014','January 2015','February 2015','March 2015', 'April 2015','May 2015','June 2015','July 2015','August 2015','September 2015','October 2015','November 2015','December 2015']
all_loc_dates.index = pd.CategoricalIndex(set(df['Date'].dt.strftime('%B %Y')), categories=month_year, ordered=True)
all_loc_dates = all_loc_dates.sort_index()
print(all_loc_dates)

哪个似乎是相同的代码,但我得到以下输出:

January 2013       9
February 2013      1
March 2013         7
April 2013         7
May 2013           4
June 2013          9
July 2013          4
September 2013    10
October 2013       1
November 2013      8
December 2013      5
...
June 2015          4
July 2015          2
August 2015       12
September 2015     4
October 2015       3
November 2015      2
December 2015      2

两种情况下的df相同(只是在不同的项目中)。我不确定为什么每年每个月的频率会有所不同且不正确?

我发现问题似乎出在以下代码行中:

all_loc_dates.index = pd.CategoricalIndex(set(df['Date'].dt.strftime('%B %Y')), categories=month_year, ordered=True)
all_loc_dates = all_loc_dates.sort_index()

排除这些行后,将给出正确的输出。虽然订购不正确。输出如下:

April 2013         4
April 2014         4
April 2015         7
August 2014        5
August 2015        6
December 2013      2
December 2014      3
December 2015     10
February 2013      4
February 2014      4
February 2015      3
January 2013       7
January 2014       8
January 2015       5
...
September 2013     2
September 2014     1
September 2015     3

这些似乎是正确的值,但是现在当然订购不正确。不知道为什么pd.CategoricalIndex函数更改值。如果不使用此功能,如何按日期对系列进行排序?

1 个答案:

答案 0 :(得分:0)

您需要将月份名称更改为月份号,例如,2013年1月为2013年1月1日。 然后对其进行排序,然后将其再次更改为月份名称-年份。

df['date value'] = pd.to_datetime(df['date value'], format='%b%Y')

df = df.sort_values('date value', ascending = True)