我正在尝试找出如何获取前30-60天的唯一用户ID的过滤数据帧。我可以使用以下代码获得前30天的信息
get_first_month = get_first_90.loc[df.groupby('User ID')['Date'].apply(lambda g: g <= g.min() + timedelta(days=30))]
但是我不知道如何指定30-60天。我尝试过:
get_first_month = get_first_90.loc[df.groupby('User ID')['Date'].apply(lambda g: g.min() + timedelta(days30) > g <= g.min() + timedelta(days60))]
但是它返回有关一个系列的真值不明确的错误。我也尝试了其他几种方法,但无法弄清楚。感谢您的宝贵时间!
答案 0 :(得分:2)
您应使用groupby
+ transform
将min
的最晚日期广播回该用户的每一行。然后,您可以为整个DataFrame创建一个简单的掩码,以检查其是否为between
的最小日期和某些偏移量。 (在这里,我将使用2天和3天,但您可以将实际数据轻松地更改为30天和60天。)
import pandas as pd
df = pd.DataFrame({'User ID': ['A']*5+['B']*7,
'Date': pd.date_range('2010-01-01', freq='1D', periods=12)})
# Earliest Date for each `User ID`
s = df.groupby('User ID')['Date'].transform('min')
# Boolean mask of dates between 2 and 3 days (inclusive) after the earliest date
m = df['Date'].between(s+pd.offsets.DateOffset(days=2),
s+pd.offsets.DateOffset(days=3))
df.loc[m]
# User ID Date
#2 A 2010-01-03
#3 A 2010-01-04
#7 B 2010-01-08
#8 B 2010-01-09
为完整起见,这是将掩码分配回DataFrame时的样子。
df['select'] = m
# User ID Date select
#0 A 2010-01-01 False
#1 A 2010-01-02 False
#2 A 2010-01-03 True
#3 A 2010-01-04 True
#4 A 2010-01-05 False
#5 B 2010-01-06 False
#6 B 2010-01-07 False
#7 B 2010-01-08 True
#8 B 2010-01-09 True
#9 B 2010-01-10 False
#10 B 2010-01-11 False
#11 B 2010-01-12 False
行也不必纯粹是日期。只要它介于[min_datetime + 2 days,min_datetime + 3 days]之间,就会被选中。
答案 1 :(得分:2)
g.min() + pd.Timedelta(days=30)
g.min() + pd.Timedelta(days=60)
date + 30 <= date <= date + 60
(...) & (...)
中,这就是为什么问题中的实现不起作用的原因。pandas.Timedelta
,因此无需导入timedelta
中的datetime
。.apply
更快。import pandas as pd
import random # just for test data
# setup test data for example
random.seed(365)
data = {'User ID': [random.choice(['A', 'B', 'C', 'D', 'E']) for _ in range(90)],
'Date': pd.bdate_range('2020-09-20', freq='d', periods=90).tolist()}
df = pd.DataFrame(data)
# selected data;
between_30_60 = df.loc[df.groupby('User ID')['Date'].apply(lambda g: (g >= g.min() + pd.Timedelta(days=30)) & (g <= g.min() + pd.Timedelta(days=60)))]
# display(between_30_60)
User ID Date
32 B 2020-10-22
33 C 2020-10-23
34 E 2020-10-24
35 C 2020-10-25
36 B 2020-10-26
37 E 2020-10-27
38 B 2020-10-28
39 B 2020-10-29
41 A 2020-10-31
42 C 2020-11-01
43 C 2020-11-02
44 E 2020-11-03
45 D 2020-11-04
46 B 2020-11-05
47 D 2020-11-06
48 A 2020-11-07
49 C 2020-11-08
50 D 2020-11-09
51 C 2020-11-10
52 B 2020-11-11
53 E 2020-11-12
54 D 2020-11-13
55 B 2020-11-14
56 A 2020-11-15
57 C 2020-11-16
58 D 2020-11-17
59 C 2020-11-18
60 D 2020-11-19
61 A 2020-11-20
65 D 2020-11-24
68 A 2020-11-27
71 A 2020-11-30