在熊猫群居中的异常值

时间:2020-10-15 10:43:03

标签: pandas dataframe pandas-groupby

我有一个类似的数据(玩具数据):

import pandas as pd
import numpy as np

N=5
dfi = pd.DataFrame()
for i in range(5):
    df = pd.DataFrame(index=pd.date_range("20100101", periods=N, freq='M'))
    df['price'] = np.random.randint(0,N,size=(len(df)))
    df['quantity'] = np.random.randint(0,N,size=(len(df)))
    df['type'] = 'P'+str(i)
    dfi = pd.concat([df, dfi], axis=0)
dfi    

由此,我想计算每种类型的新价格,即:

new_price = (1+perf)*new_price(t-1)
with : 
new_price(0)=price(0) 
and 
perf = price(t)/price(t-1) if abs(price(t)/price(t-1)-1)<s else 0 

我尝试过:

dfi['prix_corr'] = (dfi
                   .sort_index()
                   .groupby('type').price
                   .apply(lambda x: x.pct_change() if x.pct_change().abs() <= 0.5 else 0)
                   )

但收到错误消息:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
``

I would like to correct in each group for outlier time series data. 
Any suggestion ? 

1 个答案:

答案 0 :(得分:0)

根据您的输入,您可以尝试在lambda表达式中使用自定义函数,例如:

def compute_price_change(x):
    mask = x.pct_change().abs() > 0.5
    x = x.pct_change()
    x[mask] = 0
    return x

dfi['prix_corr'] = (dfi
                   .groupby('type').price
                   .apply(lambda x: compute_price_change(x))
                   )

输出:

          price quantity type prix_corr
2010-01-31  3      0      P4    NaN
2010-02-28  3      2      P4    0.0
2010-03-31  0      2      P4    -0.5
2010-04-30  2      4      P4    0.5
2010-05-31  2      2      P4    0.0
2010-01-31  1      2      P3    NaN
2010-02-28  4      3      P3    0.0
2010-03-31  0      0      P3    0.0
2010-04-30  4      0      P3    0.0
2010-05-31  2      2      P3    0.0
     .      .      .      .      .
     .      .      .      .      .
     .      .      .      .      .

由于.pct_change()对于第一个条目返回了NaN,因此您可能还希望以某种方式处理它。