Question

我想分别以1％和99％的速度对样品进行冷轧，因此我使用scipy对我的样品进行冷轧。冻结后，我的样本的最大值疯狂地大于99％百分数的值。我不知道为什么会这样吗？我的示例是：

Total Sales         Assets     Market value 
1000                 123        4892  
1232                 12         NaN
125                  1569       156

我用过：

import scipy.stats as sp

for col in df.columns: 
     sp.mstats.winsorize(df[col], limits=0.01, inplace=True)

在对代码进行解冻后，我发现样本中的最大值仍大于99％的值。我想我犯了一些错误，但是我不知道它在哪里？

Answer 1

问题是就地操作。而是将列分配回去：

for col in df.columns: 
     df[col] = stats.mstats.winsorize(df[col], limits=0.01)

样本数据

import numpy as np
import pandas as pd
from scipy import stats

df = pd.DataFrame(np.random.randint(1, 10000, (500000, 2)))
print(df.describe())
#                   0              1
#count  500000.000000  500000.000000
#mean     4993.512288    5004.678502
#std      2888.254381    2884.128073
#min         1.000000       1.000000
#25%      2486.000000    2513.000000
#50%      4985.000000    5005.000000
#75%      7492.000000    7502.000000
#max      9999.000000    9999.000000

# inpalce doesn't change anything when looping over columns:
for col in df.columns: 
     stats.mstats.winsorize(df[col], limits=0.01, inplace=True)
print(df.describe())
#                   0              1
#count  500000.000000  500000.000000
#mean     4993.512288    5004.678502
#std      2888.254381    2884.128073
#min         1.000000       1.000000
#25%      2486.000000    2513.000000
#50%      4985.000000    5005.000000
#75%      7492.000000    7502.000000
#max      9999.000000    9999.000000

for col in df.columns: 
     df[col] = stats.mstats.winsorize(df[col], limits=0.01)
print(df.describe())
#                   0              1
#count  500000.000000  500000.000000
#mean     4993.505330    5004.690118
#std      2886.521538    2882.414353
#min       101.000000     101.000000
#25%      2486.000000    2513.000000
#50%      4985.000000    5005.000000
#75%      7492.000000    7502.000000
#max      9899.000000    9901.000000

Answer 2

Winsorizing data by column in pandas with NaN 如果您有类似的问题，请查看此链接。这个链接完美地解决了这个问题。非常感谢！

使用scipy.stats.mstats.winsorize以1％和99％的价格对我的样本进行解冻后，我的样本的最大值仍大于99％的值

2 个答案:

样本数据