Pandas Python Lifetime Curves,如何从曲线中删除异常值

时间:2017-06-14 23:00:37

标签: python pandas lifetime outliers

我有一些订阅的生命周期数据,显示一周的帐单百分比与订阅创建的第一周相比。因此,如果创建了1000个潜艇并且在“结算周5”中仍然有0.3166个(即那些计费的316.6个)。

我想清除“坏数据”。有时候,由于技术问题,我们不会收费几周,然后赶上账单。当将其用于其他预测时,这会将我的保留曲线搞砸。

如何清除其中一些技术错误相关的保留率,以便我不在我的模型中使用它们。

我在想一些逻辑。如果保留率的当前值(列X和列X -1之间的差异%)是与前3列的平均值不同的绝对增量值0.20,则删除该值。也就是在下面的最后一个表格中,我想要删除第1行%Billing Week4和%Billed Week5中的值。我认为我的规则也会杀死%Billed Week6(尽管相应表格中0.1990的实际汇率看起来不错。

我尝试过像这样做一个lambda但是得到了错误“TypeError:'float'对象不可订阅”如果我能够正常工作,我可以遍历列。

df['%BilledWeek5']= df['%BilledWeek5'].apply(lambda x: x if (((x['%BilledWeek4'] + x['%BilledWeek3'] + x['%BilledWeek2'])/3)/x-1).abs() <0.2 else '')

也许我完全以错误的方式解决这个问题。可能还有一些统计函数可供使用。

import pandas as pd

subscriptionlifetime = [{'Country':'DE','Product':'Cable','Created Week':'cWeek1','Billed Week1':0.2430,'Billed Week2':0.2240,'Billed Week3':0.2207,'Billed Week4':0.0934,
'Billed Week5':0.3166,'Billed Week6':0.1990,'Billed Week7':0.1889,'Billed Week8':0.1816},
         {'Country':'DE','Product':'Cable','Created Week':'cWeek2','Billed Week1':0.2411,'Billed Week2':0.2407,
         'Billed Week3':0.2234,'Billed Week4':0.2222,'Billed Week5':0.0917,'Billed Week6':0.3206,'Billed Week7':0.2006,'Billed Week8':0.1909},
         {'Country':'AU','Product':'Satelite','Created Week':'cWeek1','Billed Week1':0.3019,'Billed Week2':0.2884,
         'Billed Week3':0.2884,'Billed Week4':0.2682,'Billed Week5':0.2657,'Billed Week6':0.1076,'Billed Week7':0.3856,'Billed Week8':0.2403},
         {'Country':'AU','Product':'Satelite','Created Week':'cWeek2','Billed Week1':0.2864,'Billed Week2':0.2748,
         'Billed Week3':0.2623,'Billed Week4':0.2453,'Billed Week5':0.2420,'Billed Week6':0.0963,'Billed Week7':0.3539,'Billed Week8':0.2216}]

df = pd.DataFrame(subscriptionlifetime)

df = df[['Country','Product','Created Week', 'Billed Week1', 'Billed Week2', 'Billed Week3', 'Billed Week4', 'Billed Week5','Billed Week6' , 'Billed Week7', 'Billed Week8']]         

print(df)

  Country   Product Created Week  Billed Week1  Billed Week2  Billed Week3  \
0      DE     Cable       cWeek1        0.2430        0.2240        0.2207   
1      DE     Cable       cWeek2        0.2411        0.2407        0.2234   
2      AU  Satelite       cWeek1        0.3019        0.2884        0.2884   
3      AU  Satelite       cWeek2        0.2864        0.2748        0.2623   

   Billed Week4  Billed Week5  Billed Week6  Billed Week7  Billed Week8  
0        0.0934        0.3166        0.1990        0.1889        0.1816  
1        0.2222        0.0917        0.3206        0.2006        0.1909  
2        0.2682        0.2657        0.1076        0.3856        0.2403  
3        0.2453        0.2420        0.0963        0.3539        0.2216  


for x in range(2,8):

    df['%BilledWeek'+str(x)] = df['Billed Week'+str(x)]/df['Billed Week'+str(x-1)]
    print (x)

print(df)


  Country   Product Created Week  Billed Week1  Billed Week2  Billed Week3  \
0      DE     Cable       cWeek1        0.2430        0.2240        0.2207   
1      DE     Cable       cWeek2        0.2411        0.2407        0.2234   
2      AU  Satelite       cWeek1        0.3019        0.2884        0.2884   
3      AU  Satelite       cWeek2        0.2864        0.2748        0.2623   

   Billed Week4  Billed Week5  Billed Week6  Billed Week7  Billed Week8  \
0        0.0934        0.3166        0.1990        0.1889        0.1816   
1        0.2222        0.0917        0.3206        0.2006        0.1909   
2        0.2682        0.2657        0.1076        0.3856        0.2403   
3        0.2453        0.2420        0.0963        0.3539        0.2216   

   %BilledWeek2  %BilledWeek3  %BilledWeek4  %BilledWeek5  %BilledWeek6  \
0      0.921811      0.985268      0.423199      3.389722      0.628553   
1      0.998341      0.928126      0.994628      0.412691      3.496183   
2      0.955283      1.000000      0.929958      0.990679      0.404968   
3      0.959497      0.954512      0.935189      0.986547      0.397934   

   %BilledWeek7  
0      0.949246  
1      0.625702  
2      3.583643  
3      3.674974  

0 个答案:

没有答案