我有一个包含以下列的数据框:
Date_2 Date_1 is_B
02/08/2019 01/09/2019 1
02/08/2019 01/09/2019 1
02/08/2019 01/09/2019 0
02/08/2019 01/09/2019 0
. . .
. . .
. . .
31/08/2019 01/09/2019 0
31/08/2019 01/09/2019 0
31/08/2019 01/09/2019 0
31/08/2019 01/09/2019 0
31/08/2019 01/09/2019 0
31/08/2019 01/09/2019 1
31/08/2019 01/09/2019 1
我想生成另一个数据帧df2,以使输出如下所示:
Date_1 Total_count Total(is_b = 1) num_2 num_3 num_5 num_20
01/09/2019 493 147 26 30 32 59
Total_Count = total entries for Date_1 in the dataframe
Total(is_b = 1) = total entries for Date_1 where is_b = 1
num_2 = total entries for Date_1 for 2 days where Date_2 = (Date_1 - 1 to Date_1 - 2){Both included as well}
num_3 = total entries for Date_1 for 3 days where Date_2 = (Date_1 - 3 to Date_1 - 5){Both included as well}
num_5 = total entries for Date_1 for 5 days where Date_2 = (Date_1 - 6 to Date_1 - 10){Both included as well}
num_20 = total entries for Date_1 for 20 days where Date_2 = (Date_1 - 11 to Date_1 - 30){Both included as well}
我能够使用轻松生成前两列:
df.groupby('Date_1')['Date_1'].count()
df.loc[df.isBooked == 1].groupby('Date_1')['Date_1'].count()
我不确定如何计算其他列:
我确实尝试过:
df.loc[(df.isBooked == 1) & (df.Booking_Date = Flight_Date - 1) & (df.Booking_Date = Flight_Date - 2)].groupby('Flight_Date')['Flight_Date'].count().reset_index(name='num_2')
但这完全是无效的语法。
谁能帮我生成列num_2,num_3,num_5,num_20。
答案 0 :(得分:4)
答案分为两个部分。
从示例中可以看出,日期未解析 -它们是字符串。 必须对它们进行解析才能执行日期操作。
import pandas as pd
def dateparse(d):
return pd.datetime.strptime(d, '%d/%m/%Y')
for c in ['Date_1', 'Date_2']:
df[c] = df[c].map(dateparse)
如果打印df
,它应该看起来像这样(注意日期格式):
Date_2 Date_1 is_B
0 2019-08-02 2019-09-01 1
1 2019-08-02 2019-09-01 1
2 2019-08-02 2019-09-01 0
3 2019-08-02 2019-09-01 0
现在,这些列中有dtype: datetime64[ns]
。
我们将计算一些以Date_1
为索引的序列,然后将其合并。
total_count = df.groupby('Date_1')['Date_1'].count().rename('Total_Count')
total_count_is_b = df[df.is_B == 1].groupby('Date_1')['Date_1'] \
.count().rename('Total(is_b = 1)')
要获得num_2
,请执行以下操作:
from datetime import timedelta
num_2_df = df[
(df.is_B == 1) &
df.Date_2.between(
df.Date_1 - timedelta(days=2),
df.Date_1 - timedelta(days=1)
)
].groupby('Date_1')['Date_2'].count().rename('num_2')
# notice argument order of `pandas.Series.between`
其他num_3
,num_5
,num_20
的计算方法类似:
num_3_df = df[
(df.is_B == 1) &
df.Date_2.between(df.Date_1 - timedelta(days=5), df.Date_1 - timedelta(days=3))
].groupby('Date_1')['Date_2'].count().rename('num_3')
num_5_df = df[
(df.is_B == 1) &
df.Date_2.between(df.Date_1 - timedelta(days=10), df.Date_1 - timedelta(days=6))
].groupby('Date_1')['Date_2'].count().rename('num_5')
num_20_df = df[
(df.is_B == 1) &
df.Date_2.between(df.Date_1 - timedelta(days=30), df.Date_1 - timedelta(days=11))
].groupby('Date_1')['Date_2'].count().rename('num_20')
最后,所有列都合并到一个表中:
result_df = pd.concat(
[total_count, total_count_is_b, num_2_df, num_3_df, num_5_df, num_20_df],
axis=1
).fillna(0).astype(int)
result_df = result_df.reset_index()