如何用groupby替换离群值?

时间:2020-10-19 14:00:05

标签: python pandas dataframe group-by

嗨,这是我的(玩具)数据:

data = {'p1': [100., 101, 102, 100, 100],
        'p2': [100., 99., 98., 100., 100],
        'p3': [1000., 1000., 100., 1000., 1000]
        }
df = (pd.DataFrame(data, index=pd.bdate_range(start='20100101', periods=5))
      .stack()
      .reset_index()
      .rename(columns={'level_0': 'date', 'level_1': 'type', 0: 'price'})
      .sort_values('date')
      )
df['perf'] = df.groupby('type')['price'].apply(lambda x: x.pct_change(1))
df.sort_values('type')

外观如下:

0   2010-01-01  p1  100.0   NaN
3   2010-01-04  p1  101.0   0.010000
6   2010-01-05  p1  102.0   0.009901
9   2010-01-06  p1  100.0   -0.019608
12  2010-01-07  p1  100.0   0.000000
1   2010-01-01  p2  100.0   NaN
4   2010-01-04  p2  99.0    -0.010000
7   2010-01-05  p2  98.0    -0.010101
10  2010-01-06  p2  100.0   0.020408
13  2010-01-07  p2  100.0   0.000000
2   2010-01-01  p3  1000.0  NaN
5   2010-01-04  p3  1000.0  0.000000
8   2010-01-05  p3  100.0   -0.900000  -> outlier
11  2010-01-06  p3  1000.0  9.000000.  -> outlier
14  2010-01-07  p3  1000.0  0.000000

我想用没有这些数据的perf列的平均值或中位数替换这些(2)值。我的意思是,我在先前的帮助下进行了计算:

# perf for each type 
df['perf'] = df.groupby('type')['price'].apply(lambda x: x.pct_change(1))

# Outliers & replace value with median by date 
outliers = df.groupby('type')['price'].apply(lambda x: (x.pct_change(1).abs() >= 0.5))
df.loc[outliers, "perf"] = (df[~outliers]
                            .groupby('date')
                            .median()
                            .loc[df.loc[outliers, "date"], "perf"]
                            .values
                            )

df['price2'] = (df.groupby('type')['price'].transform(lambda x: x.iloc[0])).mul(df.groupby('type')['perf'].apply(lambda x: (1+x).cumprod()), fill_value=1) 
# New price with the same initial value of the prices but with perf corrected 

df.sort_values('type')

,但最后不是“ nice”。有没有办法通过例如函数来改善我的代码?

2 个答案:

答案 0 :(得分:0)

这应该有效。

# Filter for outliers
outliers = df['perf'].abs() >= 0.5

# Create DataFrame for the mean of each date
dt_mean = df.groupby('date')['perf'].mean().to_frame().copy()

# Reset index
dt_mean.reset_index(inplace=True) 

# Set outliers equal to merger of outliers and mean DataFrame
df.loc[outliers,'perf'] = list(pd.merge(df.loc[outliers, ['date', 'type', 'price']],dt_mean, on='date')['perf'])

    date       type price   perf
0   2010-01-01  p1  100.0   NaN
1   2010-01-01  p2  100.0   NaN
2   2010-01-01  p3  1000.0  NaN
3   2010-01-04  p1  101.0   0.010000
4   2010-01-04  p2  99.0    -0.010000
5   2010-01-04  p3  1000.0  0.000000
6   2010-01-05  p1  102.0   0.009901
7   2010-01-05  p2  98.0    -0.010101
8   2010-01-05  p3  100.0   -0.300067
9   2010-01-06  p1  100.0   -0.019608
10  2010-01-06  p2  100.0   0.020408
11  2010-01-06  p3  1000.0  3.000267
12  2010-01-07  p1  100.0   0.000000
13  2010-01-07  p2  100.0   0.000000
14  2010-01-07  p3  1000.0  0.000000

答案 1 :(得分:0)

如何对平均数据帧执行直接/* * Make sure the module is in a Tx configuration before trying to use the Tx parameters. */ if (uart4.tx_sending) { /* * If the TX_LIST node has more data to transmit, write the next byte to the UART. */ if (uart4.tx_cnt < uart4.tx_list->len) { // We use mimicced hardware flow control. Don't send until we are clear to if (pinGet(CELL_CTS) == FALSE) USART_SendData(UART4, uart4.tx_list->ptr[uart4.tx_cnt++]); else pinSet(CELL_RTS); } else { /* * The last byte has been sent. Disable subsequent Tx interrupts, but enable interrupt to * indicate last byte sent and register now empty */ USART_ITConfig(UART4, USART_IT_TXE, DISABLE); USART_ITConfig(UART4, USART_IT_TC, ENABLE); pinClear(CELL_RTS); } } 查询?

.loc[]

请注意,您的日期平均值(outliers = df.groupby('type')['price'].apply(lambda x: (x.pct_change(1).abs() >= 0.5)) df_mean = df[~outliers].groupby('date').mean() fill_values = df_mean.loc[df.loc[outliers, "date"], "perf"].values df.loc[outliers, "perf"] = fill_values # broadcast df.sort_values('type') Out[114]: date type price perf 0 2010-01-01 p1 100.0 NaN 3 2010-01-04 p1 101.0 0.010000 6 2010-01-05 p1 102.0 0.009901 9 2010-01-06 p1 100.0 -0.019608 12 2010-01-07 p1 100.0 0.000000 1 2010-01-01 p2 100.0 NaN 4 2010-01-04 p2 99.0 -0.010000 7 2010-01-05 p2 98.0 -0.010101 10 2010-01-06 p2 100.0 0.020408 13 2010-01-07 p2 100.0 0.000000 2 2010-01-01 p3 1000.0 NaN 5 2010-01-04 p3 1000.0 0.000000 8 2010-01-05 p3 100.0 -0.000100 <- replaced by mean 11 2010-01-06 p3 1000.0 0.000400 <- replaced by mean 14 2010-01-07 p3 1000.0 0.000000 )已被df_mean索引,似乎无法避免创建它。因此,直接使用其日期索引即可。

date