所以我有2个pandas数据帧。一个人在给定日期范围之间拥有一个人的所有日期(df_all_days),另一个人只有该人的活动天数(df_active_days)。我想删除df_all_days中的非活动行,前提是此人连续3天以上不活动。并且仅删除满足此条件的日期,而不删除任何其他活动或非活动日期。
例如在下面;在df_all_days中删除' DG-3465'的所有行不要删除2/2 - 2/3的非活动日期,但删除2 / 8-2 / 12之间的非活动日期。 此外,删除所有之间的' TY-9456' 2 / 9-2 / 13
), filter as (
SELECT P.Id PhoneId, P.Number, IP.Name, IP.VoiceId, V.*,
ROW_NUMBER() OVER (PARTITION BY P.Id ORDER BY BillingCycle DESC) as rn
FROM Phones P
LEFT JOIN InvoicePhones IP on IP.PhoneId = P.Id
LEFT JOIN Voices V on V.Id = IP.VoiceId
)
SELECT *
FROM filter
WHERE rn = 1
我尝试合并两个dfs,然后使用回填作为NaN的日期。然后在所有行中添加一列1。然后计划是在日期相同的情况下进行滚动求和然后删除所有行的总和大于3.但这有2个问题,
答案 0 :(得分:0)
#merge two DFs and get a indicator for inactive days
merged = pd.merge(df_all_days,df_active_days,how='left',on=['PersonID','Date'],indicator=True)
indicators = merged._merge.tolist()
#check if the inactive days last for more than 2 days
candidate=[]
final=[]
for k,v in enumerate(indicators):
if (v!='left_only'):
if len(candidate)<3:
candidate=[]
else:
final.extend(candidate)
candidate=[]
else:
candidate.append(k)
if len(candidate)>2:
final.extend(candidate)
#remove rows where there are more than 2 consecutive inactive days.
df_final = merged[~merged.index.isin(final)][['PersonID','Date']]
df_final
Out[863]:
PersonID Date
0 AB-123 2016-02-01
1 AB-123 2016-02-02
2 AB-123 2016-02-03
6 AB-123 2016-02-07
7 AB-123 2016-02-08
8 AB-123 2016-02-09
9 AB-123 2016-02-10
10 AB-123 2016-02-11
11 AB-123 2016-02-12
12 AB-123 2016-02-13
13 DG-3465 2016-02-01
14 DG-3465 2016-02-02
15 DG-3465 2016-02-03
16 DG-3465 2016-02-04
17 DG-3465 2016-02-05
18 DG-3465 2016-02-06
19 DG-3465 2016-02-07
25 DG-3465 2016-02-13
26 TY-9456 2016-02-01
27 TY-9456 2016-02-02
28 TY-9456 2016-02-03
29 TY-9456 2016-02-04
30 TY-9456 2016-02-05
31 TY-9456 2016-02-06
32 TY-9456 2016-02-07
33 TY-9456 2016-02-08