下面的代码过滤掉日期以获得每个月的第一天。但是由于某种原因,它不包括每年的第一个月,例如它忽略日期 '2020-01-01 00:00:00'
并直接转到 '2020-02-01 00:00:00'
。我该如何解决这个问题?
import numpy as np
import pandas as pd
from pandas import DataFrame
date_list = ['2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00', '2019-11-05 00:00:00',
'2019-12-01 00:00:00', '2020-01-04 00:00:00', '2020-01-12 00:00:00','2020-01-01 00:00:00', '2020-02-01 00:00:00',
'2020-03-01 00:00:00', '2020-04-01 00:00:00','2020-04-02 00:00:00', '2020-05-01 00:00:00', '2020-05-20 00:00:00',
'2020-06-01 00:00:00', '2020-07-01 00:00:00','2020-07-03 00:00:00','2020-07-07 00:00:00', '2020-08-01 00:00:00',
'2020-09-01 00:00:00','2020-10-01 00:00:00', '2020-11-01 00:00:00', '2020-11-04 00:00:00','2020-11-06 00:00:00',
'2020-08-05 00:00:00','2020-12-01 00:00:00','2021-01-01 00:00:00','2021-02-01 00:00:00', '2021-03-01 00:00:00',
'2021-04-01 00:00:00']
data = DataFrame (date_list,columns=['Data'])
datetime = pd.to_datetime(data['Data'])
monthly_changes = data.loc[np.where(datetime.dt.month.diff().gt(0))].index.tolist()
输出:
['2019-10-01 00:00:00' '2019-11-01 00:00:00' '2019-12-01 00:00:00'
'2020-02-01 00:00:00' '2020-03-01 00:00:00' '2020-04-01 00:00:00'
'2020-05-01 00:00:00' '2020-06-01 00:00:00' '2020-07-01 00:00:00'
'2020-08-01 00:00:00' '2020-09-01 00:00:00' '2020-10-01 00:00:00'
'2020-11-01 00:00:00' '2020-12-01 00:00:00' '2021-02-01 00:00:00'
'2021-03-01 00:00:00' '2021-04-01 00:00:00']
预期输出
'2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00',
'2019-12-01 00:00:00', '2020-01-01 00:00:00', '2020-02-01 00:00:00',
'2020-03-01 00:00:00', '2020-04-01 00:00:00', '2020-05-01 00:00:00',
'2020-06-01 00:00:00', '2020-07-01 00:00:00', '2020-08-01 00:00:00',
'2020-09-01 00:00:00','2020-10-01 00:00:00', '2020-11-01 00:00:00',
'2020-12-01 00:00:00','2021-01-01 00:00:00','2021-02-01 00:00:00', '2021-03-01 00:00:00',
'2021-04-01 00:00:00'
答案 0 :(得分:1)
似乎只检查 day
是否为 1
(第一个)会更容易:
monthly_changes = data.loc[datetime.dt.day == 1, 'Data'].tolist()
monthly_changes
:
['2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00',
'2019-12-01 00:00:00', '2020-01-01 00:00:00', '2020-02-01 00:00:00',
'2020-03-01 00:00:00', '2020-04-01 00:00:00', '2020-05-01 00:00:00',
'2020-06-01 00:00:00', '2020-07-01 00:00:00', '2020-08-01 00:00:00',
'2020-09-01 00:00:00', '2020-10-01 00:00:00', '2020-11-01 00:00:00',
'2020-12-01 00:00:00', '2021-01-01 00:00:00', '2021-02-01 00:00:00',
'2021-03-01 00:00:00', '2021-04-01 00:00:00']
编辑:根据评论,测试时间是否也是00:00:00
:
from datetime import time
monthly_changes = data.loc[
datetime.dt.day == 1 &
datetime.dt.time.eq(time(hour=0, minute=0, second=0)),
'Data'
].tolist()
monthly_changes
:
['2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00',
'2019-12-01 00:00:00', '2020-01-01 00:00:00', '2020-02-01 00:00:00',
'2020-03-01 00:00:00', '2020-04-01 00:00:00', '2020-05-01 00:00:00',
'2020-06-01 00:00:00', '2020-07-01 00:00:00', '2020-08-01 00:00:00',
'2020-09-01 00:00:00', '2020-10-01 00:00:00', '2020-11-01 00:00:00',
'2020-12-01 00:00:00', '2021-01-01 00:00:00', '2021-02-01 00:00:00',
'2021-03-01 00:00:00', '2021-04-01 00:00:00']
为什么上面的方法不起作用?
查看中间步骤:
datetime = pd.to_datetime(data['Data'])
data['month'] = datetime.dt.month
data['diff'] = datetime.dt.month.diff()
Data month diff
0 2019-09-01 00:00:00 9 NaN
1 2019-10-01 00:00:00 10 1.0
2 2019-11-01 00:00:00 11 1.0
3 2019-11-05 00:00:00 11 0.0
4 2019-12-01 00:00:00 12 1.0
5 2020-01-04 00:00:00 1 -11.0 # 1 - 12 !> 0
答案 1 :(得分:0)
我建议不要使用日期时间作为您系列的名称,因为它很常见:
from datetime import datetime
无论如何,关于你的问题
monthly_changes = data.loc[(datetime.dt.month!=datetime.shift(1).dt.month)].index.tolist()
解释就是 shift 将行向前移动一个,然后对于上个月不同的索引,您将得到 True。