基于组Winsorize DataFrame

时间:2019-11-22 12:43:19

标签: python pandas group-by pandas-groupby statsmodels

我做了以下可复制的例子:

col1 = pd.Series(['2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31'])
col2 = pd.Series(['Discr','Discr','Discr','Discr','Discr','Discr', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv','Discr','Discr','Discr','Discr','Discr','Discr','Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv'])
col3 = pd.Series(['Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond'])
col4 = pd.Series([5,3,200, 5,7,23,5,4,21,68,45,324,32,4,78,2,45,2,56,3,5,7,22,45])
Example = pd.DataFrame(data = pd.concat([col1,col2,col3,col4], axis=1))
Example.columns =  ['Date', 'InType', 'AType', 'Value']

其外观如下: enter image description here

我想通过首先对“日期”,“ Intype”和“ Atype”进行分组,以1%的水平赢得“ Value”列的奖励。例如,我要赢得的列的第一组具有日期2016-04-30,Intype = Discr,并且AType = Eq。在这种情况下,我希望将200设置为等于5。我想对所有组分别进行设置。

这是我到目前为止尝试过的:

def using_mstats_df(df):
    return df.apply(using_mstats, axis=0)

def using_mstats(s):
    return mstats.winsorize(s, limits=[0.0, 0.5])
grouped = Example.groupby(['Date', 'InType', 'AType'])
grouped.apply(using_mstats_df)

这似乎是对的,但是当我在实际的(大型)数据集上尝试时,会收到一个很大的错误,结尾为

ValueError:无法从重复的轴重新索引

有人知道我可能在做错什么,还是我应该以其他方式来做?

1 个答案:

答案 0 :(得分:1)

这是一个可行的示例(对于Winsorizing,我不确定100%)

import pandas as pd
import scipy.stats

col1 = pd.Series(['2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-04-30','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31','2016-05-31'])
col2 = pd.Series(['Discr','Discr','Discr','Discr','Discr','Discr', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv','Discr','Discr','Discr','Discr','Discr','Discr','Adv', 'Adv', 'Adv', 'Adv', 'Adv', 'Adv'])
col3 = pd.Series(['Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond', 'Eq', 'Eq', 'Eq' , 'Bond','Bond','Bond'])
col4 = pd.Series([5,3,200, 5,7,23,5,4,21,68,45,324,32,4,78,2,45,2,56,3,5,7,22,45])
df = pd.DataFrame(data = pd.concat([col1,col2,col3,col4], axis=1))
df.columns =  ['Date', 'InType', 'AType', 'Value']

# sort your df
df = df.sort_values(['Date', 'InType', 'AType'])

# empty list to store the values column after winsorization
winsorized_values = []

# winsorize every group
for name, group in df.groupby(['Date', 'InType', 'AType']):
    winsorized_values.append(list(scipy.stats.mstats.winsorize(group.Value.values, limits=[0.01, 0.99])))

# append the winsorized values to dataframe, after flatening the list
df['winsorized_values'] = [item for sublist in winsorized_values for item in sublist]