月过滤熊猫数据框 Python

时间:2021-07-04 21:50:28

标签: python-3.x pandas dataframe numpy date

下面的代码过滤掉日期以获得每个月的第一天。但是由于某种原因,它不包括每年的第一个月,例如它忽略日期 '2020-01-01 00:00:00' 并直接转到 '2020-02-01 00:00:00'。我该如何解决这个问题?

import numpy as np
import pandas as pd
from pandas import DataFrame

date_list = ['2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00', '2019-11-05 00:00:00',
 '2019-12-01 00:00:00', '2020-01-04 00:00:00', '2020-01-12 00:00:00','2020-01-01 00:00:00', '2020-02-01 00:00:00', 
 '2020-03-01 00:00:00', '2020-04-01 00:00:00','2020-04-02 00:00:00', '2020-05-01 00:00:00', '2020-05-20 00:00:00',
 '2020-06-01 00:00:00', '2020-07-01 00:00:00','2020-07-03 00:00:00','2020-07-07 00:00:00', '2020-08-01 00:00:00',
 '2020-09-01 00:00:00','2020-10-01 00:00:00', '2020-11-01 00:00:00', '2020-11-04 00:00:00','2020-11-06 00:00:00',
 '2020-08-05 00:00:00','2020-12-01 00:00:00','2021-01-01 00:00:00','2021-02-01 00:00:00', '2021-03-01 00:00:00', 
 '2021-04-01 00:00:00']

data = DataFrame (date_list,columns=['Data'])
datetime = pd.to_datetime(data['Data'])

monthly_changes = data.loc[np.where(datetime.dt.month.diff().gt(0))].index.tolist()

输出:

['2019-10-01 00:00:00' '2019-11-01 00:00:00' '2019-12-01 00:00:00'
 '2020-02-01 00:00:00' '2020-03-01 00:00:00' '2020-04-01 00:00:00'
 '2020-05-01 00:00:00' '2020-06-01 00:00:00' '2020-07-01 00:00:00'
 '2020-08-01 00:00:00' '2020-09-01 00:00:00' '2020-10-01 00:00:00'
 '2020-11-01 00:00:00' '2020-12-01 00:00:00' '2021-02-01 00:00:00'
 '2021-03-01 00:00:00' '2021-04-01 00:00:00']

预期输出

'2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00',
 '2019-12-01 00:00:00', '2020-01-01 00:00:00', '2020-02-01 00:00:00', 
 '2020-03-01 00:00:00', '2020-04-01 00:00:00', '2020-05-01 00:00:00', 
 '2020-06-01 00:00:00', '2020-07-01 00:00:00', '2020-08-01 00:00:00',
 '2020-09-01 00:00:00','2020-10-01 00:00:00', '2020-11-01 00:00:00', 
 '2020-12-01 00:00:00','2021-01-01 00:00:00','2021-02-01 00:00:00', '2021-03-01 00:00:00', 
 '2021-04-01 00:00:00'

2 个答案:

答案 0 :(得分:1)

似乎只检查 day 是否为 1(第一个)会更容易:

monthly_changes = data.loc[datetime.dt.day == 1, 'Data'].tolist()

monthly_changes

['2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00',
 '2019-12-01 00:00:00', '2020-01-01 00:00:00', '2020-02-01 00:00:00',
 '2020-03-01 00:00:00', '2020-04-01 00:00:00', '2020-05-01 00:00:00',
 '2020-06-01 00:00:00', '2020-07-01 00:00:00', '2020-08-01 00:00:00',
 '2020-09-01 00:00:00', '2020-10-01 00:00:00', '2020-11-01 00:00:00',
 '2020-12-01 00:00:00', '2021-01-01 00:00:00', '2021-02-01 00:00:00',
 '2021-03-01 00:00:00', '2021-04-01 00:00:00']

编辑:根据评论,测试时间是否也是00:00:00

from datetime import time

monthly_changes = data.loc[
    datetime.dt.day == 1 &
    datetime.dt.time.eq(time(hour=0, minute=0, second=0)),
    'Data'
].tolist()

monthly_changes

['2019-09-01 00:00:00', '2019-10-01 00:00:00', '2019-11-01 00:00:00',
 '2019-12-01 00:00:00', '2020-01-01 00:00:00', '2020-02-01 00:00:00',
 '2020-03-01 00:00:00', '2020-04-01 00:00:00', '2020-05-01 00:00:00',
 '2020-06-01 00:00:00', '2020-07-01 00:00:00', '2020-08-01 00:00:00',
 '2020-09-01 00:00:00', '2020-10-01 00:00:00', '2020-11-01 00:00:00',
 '2020-12-01 00:00:00', '2021-01-01 00:00:00', '2021-02-01 00:00:00',
 '2021-03-01 00:00:00', '2021-04-01 00:00:00']

为什么上面的方法不起作用?

查看中间步骤:

datetime = pd.to_datetime(data['Data'])
data['month'] = datetime.dt.month
data['diff'] = datetime.dt.month.diff()
                   Data  month  diff
0   2019-09-01 00:00:00      9   NaN
1   2019-10-01 00:00:00     10   1.0
2   2019-11-01 00:00:00     11   1.0
3   2019-11-05 00:00:00     11   0.0
4   2019-12-01 00:00:00     12   1.0
5   2020-01-04 00:00:00      1 -11.0  # 1 - 12 !> 0

答案 1 :(得分:0)

我建议不要使用日期时间作为您系列的名称,因为它很常见:

from datetime import datetime

无论如何,关于你的问题

monthly_changes = data.loc[(datetime.dt.month!=datetime.shift(1).dt.month)].index.tolist()

解释就是 shift 将行向前移动一个,然后对于上个月不同的索引,您将得到 True。