我有一些订阅的生命周期数据,显示一周的帐单百分比与订阅创建的第一周相比。因此,如果创建了1000个潜艇并且在“结算周5”中仍然有0.3166个(即那些计费的316.6个)。
我想清除“坏数据”。有时候,由于技术问题,我们不会收费几周,然后赶上账单。当将其用于其他预测时,这会将我的保留曲线搞砸。
如何清除其中一些技术错误相关的保留率,以便我不在我的模型中使用它们。
我在想一些逻辑。如果保留率的当前值(列X和列X -1之间的差异%)是与前3列的平均值不同的绝对增量值0.20,则删除该值。也就是在下面的最后一个表格中,我想要删除第1行%Billing Week4和%Billed Week5中的值。我认为我的规则也会杀死%Billed Week6(尽管相应表格中0.1990的实际汇率看起来不错。
我尝试过像这样做一个lambda但是得到了错误“TypeError:'float'对象不可订阅”如果我能够正常工作,我可以遍历列。
df['%BilledWeek5']= df['%BilledWeek5'].apply(lambda x: x if (((x['%BilledWeek4'] + x['%BilledWeek3'] + x['%BilledWeek2'])/3)/x-1).abs() <0.2 else '')
也许我完全以错误的方式解决这个问题。可能还有一些统计函数可供使用。
import pandas as pd
subscriptionlifetime = [{'Country':'DE','Product':'Cable','Created Week':'cWeek1','Billed Week1':0.2430,'Billed Week2':0.2240,'Billed Week3':0.2207,'Billed Week4':0.0934,
'Billed Week5':0.3166,'Billed Week6':0.1990,'Billed Week7':0.1889,'Billed Week8':0.1816},
{'Country':'DE','Product':'Cable','Created Week':'cWeek2','Billed Week1':0.2411,'Billed Week2':0.2407,
'Billed Week3':0.2234,'Billed Week4':0.2222,'Billed Week5':0.0917,'Billed Week6':0.3206,'Billed Week7':0.2006,'Billed Week8':0.1909},
{'Country':'AU','Product':'Satelite','Created Week':'cWeek1','Billed Week1':0.3019,'Billed Week2':0.2884,
'Billed Week3':0.2884,'Billed Week4':0.2682,'Billed Week5':0.2657,'Billed Week6':0.1076,'Billed Week7':0.3856,'Billed Week8':0.2403},
{'Country':'AU','Product':'Satelite','Created Week':'cWeek2','Billed Week1':0.2864,'Billed Week2':0.2748,
'Billed Week3':0.2623,'Billed Week4':0.2453,'Billed Week5':0.2420,'Billed Week6':0.0963,'Billed Week7':0.3539,'Billed Week8':0.2216}]
df = pd.DataFrame(subscriptionlifetime)
df = df[['Country','Product','Created Week', 'Billed Week1', 'Billed Week2', 'Billed Week3', 'Billed Week4', 'Billed Week5','Billed Week6' , 'Billed Week7', 'Billed Week8']]
print(df)
Country Product Created Week Billed Week1 Billed Week2 Billed Week3 \
0 DE Cable cWeek1 0.2430 0.2240 0.2207
1 DE Cable cWeek2 0.2411 0.2407 0.2234
2 AU Satelite cWeek1 0.3019 0.2884 0.2884
3 AU Satelite cWeek2 0.2864 0.2748 0.2623
Billed Week4 Billed Week5 Billed Week6 Billed Week7 Billed Week8
0 0.0934 0.3166 0.1990 0.1889 0.1816
1 0.2222 0.0917 0.3206 0.2006 0.1909
2 0.2682 0.2657 0.1076 0.3856 0.2403
3 0.2453 0.2420 0.0963 0.3539 0.2216
for x in range(2,8):
df['%BilledWeek'+str(x)] = df['Billed Week'+str(x)]/df['Billed Week'+str(x-1)]
print (x)
print(df)
Country Product Created Week Billed Week1 Billed Week2 Billed Week3 \
0 DE Cable cWeek1 0.2430 0.2240 0.2207
1 DE Cable cWeek2 0.2411 0.2407 0.2234
2 AU Satelite cWeek1 0.3019 0.2884 0.2884
3 AU Satelite cWeek2 0.2864 0.2748 0.2623
Billed Week4 Billed Week5 Billed Week6 Billed Week7 Billed Week8 \
0 0.0934 0.3166 0.1990 0.1889 0.1816
1 0.2222 0.0917 0.3206 0.2006 0.1909
2 0.2682 0.2657 0.1076 0.3856 0.2403
3 0.2453 0.2420 0.0963 0.3539 0.2216
%BilledWeek2 %BilledWeek3 %BilledWeek4 %BilledWeek5 %BilledWeek6 \
0 0.921811 0.985268 0.423199 3.389722 0.628553
1 0.998341 0.928126 0.994628 0.412691 3.496183
2 0.955283 1.000000 0.929958 0.990679 0.404968
3 0.959497 0.954512 0.935189 0.986547 0.397934
%BilledWeek7
0 0.949246
1 0.625702
2 3.583643
3 3.674974