如何显示熊猫数据框的子集?

时间:2019-08-16 14:09:27

标签: python pandas datetime

我有一个数据框df,其中包含2003年2月12日至2017年6月30日之间一天中每个小时的日期时间,并且我想删除每年12月24日至1月1日之间的所有日期时间。 我的数据框的摘录是:

...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...

我的预期输出是:

...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...

5 个答案:

答案 0 :(得分:1)

示例数据框:

                dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00


解决方案:

如果要在以下日期之间将每年的日期排除在外,请先提取月份和日期:

df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day

现在进行条件检查:

dec_days = [24, 25, 26, 27, 28, 29, 30, 31]  
## if the month is dec, then check for these dates 
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]

示例输出:

                dates  month  day
0 2003-12-23 23:00:00     12   23
3 2003-12-13 23:00:00     12   13
4 2002-12-23 23:00:00     12   23

答案 1 :(得分:1)

这利用了import java.util.ArrayList; import java.util.List; public class Patient { private String name; private List<Observation> chart; public Patient(String name) { if ((name == null) || (name.trim().length() == 0)) throw new IllegalArgumentException("name cannot be blank or null"); this.name = name; this.chart = new ArrayList<>(); } public String getName() { return name; } public void addObservation(Observation observation) { if (observation != null) { this.chart.add(observation); } } @Override public String toString() { final StringBuffer sb = new StringBuffer("Patient{"); sb.append("name='").append(name).append('\''); sb.append(", chart=").append(chart); sb.append('}'); return sb.toString(); } } 格式的datetime-string是可排序的事实。从CSV文件中读取所有内容,然后过滤所需的日期:

mm-dd

答案 2 :(得分:0)

您可以将pandas和布尔过滤与strftime一起使用

# version 0.23.4
import pandas as pd

# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])

# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
    (df.date.dt.strftime('%m-%d') < '12-24') &
    (df.date.dt.strftime('%m-%d') >= '01-02')
].copy()

print(df)

                   date
0   2018-12-23 00:00:00
1   2018-12-23 01:00:00
2   2018-12-23 02:00:00
3   2018-12-23 03:00:00
4   2018-12-23 04:00:00
5   2018-12-23 05:00:00
6   2018-12-23 06:00:00
7   2018-12-23 07:00:00
8   2018-12-23 08:00:00
9   2018-12-23 09:00:00
10  2018-12-23 10:00:00
11  2018-12-23 11:00:00
12  2018-12-23 12:00:00
13  2018-12-23 13:00:00
14  2018-12-23 14:00:00
15  2018-12-23 15:00:00
16  2018-12-23 16:00:00
17  2018-12-23 17:00:00
18  2018-12-23 18:00:00
19  2018-12-23 19:00:00
20  2018-12-23 20:00:00
21  2018-12-23 21:00:00
22  2018-12-23 22:00:00
23  2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00

这将适用于多年,因为我们仅按月份和日期进行过滤。

# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])

df = df.loc[
    (df.date.dt.strftime('%m-%d') < '12-24') &
    (df.date.dt.strftime('%m-%d') >= '01-02')
].copy()

print(df)

                    date
0    2017-12-23 00:00:00
1    2017-12-23 01:00:00
2    2017-12-23 02:00:00
3    2017-12-23 03:00:00
4    2017-12-23 04:00:00
5    2017-12-23 05:00:00
6    2017-12-23 06:00:00
7    2017-12-23 07:00:00
8    2017-12-23 08:00:00
9    2017-12-23 09:00:00
10   2017-12-23 10:00:00
11   2017-12-23 11:00:00
12   2017-12-23 12:00:00
13   2017-12-23 13:00:00
14   2017-12-23 14:00:00
15   2017-12-23 15:00:00
16   2017-12-23 16:00:00
17   2017-12-23 17:00:00
18   2017-12-23 18:00:00
19   2017-12-23 19:00:00
20   2017-12-23 20:00:00
21   2017-12-23 21:00:00
22   2017-12-23 22:00:00
23   2017-12-23 23:00:00
240  2018-01-02 00:00:00
241  2018-01-02 01:00:00
242  2018-01-02 02:00:00
243  2018-01-02 03:00:00
244  2018-01-02 04:00:00
245  2018-01-02 05:00:00
...                  ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00

答案 3 :(得分:0)

您可以尝试使用条件句。也许使用与日期字符串匹配的模式,或者将日期解析为数字(如Java),然后有条件地将其删除。

datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)

查看以下内容:https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/

(如果有问题,请告诉我。)

答案 4 :(得分:-1)

由于您希望每年都发生这种情况,因此我们可以首先定义一个序列,用固定值(例如2000代替年份。假设date是存储日期的列,我们可以生成如下列:

dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})

对于给定的样本数据,我们得到:

>>> dt
0    2000-12-23
1    2000-12-23
2    2000-12-23
3    2000-12-23
4    2000-12-23
5    2000-12-23
6    2000-12-23
7    2000-12-24
8    2000-12-24
9    2000-12-24
10   2000-12-24
11   2000-12-24
12   2000-12-24
13   2000-12-24
14   2000-01-01
15   2000-01-01
16   2000-01-01
17   2000-01-01
18   2000-01-01
19   2000-01-02
20   2000-01-02
21   2000-01-02
22   2000-01-02
23   2000-01-02
24   2000-01-02
25   2000-01-02
26   2000-01-02
dtype: datetime64[ns]

接下来,我们可以过滤行,例如:

from datetime import date

df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]

这为您的样本数据提供了以下数据:

>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
      id                  dt
0   7505 2003-12-23 17:00:00
1   7506 2003-12-23 18:00:00
2   7507 2003-12-23 19:00:00
3   7508 2003-12-23 20:00:00
4   7509 2003-12-23 21:00:00
5   7510 2003-12-23 22:00:00
6   7511 2003-12-23 23:00:00
19  7728 2004-01-02 00:00:00
20  7729 2004-01-02 01:00:00
21  7730 2004-01-02 02:00:00
22  7731 2004-01-02 03:00:00
23  7732 2004-01-02 04:00:00
24  7733 2004-01-02 05:00:00
25  7734 2004-01-02 06:00:00
26  7735 2004-01-02 07:00:00

因此,无论年份是几岁,我们只会考虑1月2日和12月23日(包括两端)之间的日期。