我想分别以1%和99%的速度对样品进行冷轧,因此我使用scipy对我的样品进行冷轧。冻结后,我的样本的最大值疯狂地大于99%百分数的值。我不知道为什么会这样吗? 我的示例是:
Total Sales Assets Market value
1000 123 4892
1232 12 NaN
125 1569 156
我用过:
import scipy.stats as sp
for col in df.columns:
sp.mstats.winsorize(df[col], limits=0.01, inplace=True)
在对代码进行解冻后,我发现样本中的最大值仍大于99%的值。我想我犯了一些错误,但是我不知道它在哪里?
答案 0 :(得分:1)
问题是就地操作。而是将列分配回去:
for col in df.columns:
df[col] = stats.mstats.winsorize(df[col], limits=0.01)
import numpy as np
import pandas as pd
from scipy import stats
df = pd.DataFrame(np.random.randint(1, 10000, (500000, 2)))
print(df.describe())
# 0 1
#count 500000.000000 500000.000000
#mean 4993.512288 5004.678502
#std 2888.254381 2884.128073
#min 1.000000 1.000000
#25% 2486.000000 2513.000000
#50% 4985.000000 5005.000000
#75% 7492.000000 7502.000000
#max 9999.000000 9999.000000
# inpalce doesn't change anything when looping over columns:
for col in df.columns:
stats.mstats.winsorize(df[col], limits=0.01, inplace=True)
print(df.describe())
# 0 1
#count 500000.000000 500000.000000
#mean 4993.512288 5004.678502
#std 2888.254381 2884.128073
#min 1.000000 1.000000
#25% 2486.000000 2513.000000
#50% 4985.000000 5005.000000
#75% 7492.000000 7502.000000
#max 9999.000000 9999.000000
for col in df.columns:
df[col] = stats.mstats.winsorize(df[col], limits=0.01)
print(df.describe())
# 0 1
#count 500000.000000 500000.000000
#mean 4993.505330 5004.690118
#std 2886.521538 2882.414353
#min 101.000000 101.000000
#25% 2486.000000 2513.000000
#50% 4985.000000 5005.000000
#75% 7492.000000 7502.000000
#max 9899.000000 9901.000000
答案 1 :(得分:0)
Winsorizing data by column in pandas with NaN 如果您有类似的问题,请查看此链接。 这个链接完美地解决了这个问题。非常感谢!