我想在熊猫数据框中赢得几列数据。每列都有一些NaN,这会影响winsorization,因此需要删除它们。我知道如何执行此操作的唯一方法是删除所有数据,而不是逐列删除它们。
MWE:
import numpy as np
import pandas as pd
from scipy.stats.mstats import winsorize
# Create Dataframe
N, M, P = 10**5, 4, 10**2
dates = pd.date_range('2001-01-01', periods=N//P, freq='D').repeat(P)
df = pd.DataFrame(np.random.random((N, M))
, index=dates)
df.index.names = ['DATE']
df.columns = ['one','two','three','four']
# Now scale them differently so you can see the winsorization
df['four'] = df['four']*(10**5)
df['three'] = df['three']*(10**2)
df['two'] = df['two']*(10**-1)
df['one'] = df['one']*(10**-4)
# Create NaN
df.loc[df.index.get_level_values(0).year == 2002,'three'] = np.nan
df.loc[df.index.get_level_values(0).month == 2,'two'] = np.nan
df.loc[df.index.get_level_values(0).month == 1,'one'] = np.nan
以下是基线分布:
df.quantile([0, 0.01, 0.5, 0.99, 1])
输出:
one two three four
0.00 2.336618e-10 2.294259e-07 0.002437 2.305353
0.01 9.862626e-07 9.742568e-04 0.975807 1003.814520
0.50 4.975859e-05 4.981049e-02 50.290946 50374.548980
0.99 9.897463e-05 9.898590e-02 98.978263 98991.438985
1.00 9.999983e-05 9.999966e-02 99.996793 99999.437779
这就是我获胜的方式:
def using_mstats(s):
return winsorize(s, limits=[0.01, 0.01])
wins = df.apply(using_mstats, axis=0)
wins.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
这给出了这个:
Out[356]:
one two three four
0.00 0.000001 0.001060 1.536882 1003.820149
0.01 0.000001 0.001060 1.536882 1003.820149
0.25 0.000025 0.024975 25.200378 25099.994780
0.50 0.000050 0.049810 50.290946 50374.548980
0.75 0.000075 0.074842 74.794537 75217.343920
0.99 0.000099 0.098986 98.978263 98991.436957
1.00 0.000100 0.100000 99.996793 98991.436957
列four
是正确的,因为它没有NaN
但其他列不正确。第99百分位和Max应该是相同的。观察计数对于两者都是相同的:
In [357]: df.count()
Out[357]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
In [358]: wins.count()
Out[358]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
这就是我能解决的问题。它,但以丢失我的大量数据为代价:
wins2 = df.loc[df.notnull().all(axis=1)].apply(using_mstats, axis=0)
wins2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
输出:
Out[360]:
one two three four
0.00 9.686203e-07 0.000928 0.965702 1005.209503
0.01 9.686203e-07 0.000928 0.965702 1005.209503
0.25 2.486052e-05 0.024829 25.204032 25210.837443
0.50 4.980946e-05 0.049894 50.299004 50622.227179
0.75 7.492750e-05 0.075059 74.837900 75299.906415
0.99 9.895563e-05 0.099014 98.972310 99014.311761
1.00 9.895563e-05 0.099014 98.972310 99014.311761
In [361]: wins2.count()
Out[361]:
one 51700
two 51700
three 51700
four 51700
dtype: int64
如何在保持数据形状(即不删除行)的同时,按列列出非NaN数据?
答案 0 :(得分:3)
正如经常发生的那样,简单地创建MWE有助于澄清。我需要将clip()与quantile()结合使用,如下所示:
df2 = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99), axis=1)
df2.quantile([0, 0.01, 0.25, 0.5, 0.75, 0.99, 1])
输出:
one two three four
0.00 9.862626e-07 0.000974 0.975807 1003.814520
0.01 9.862666e-07 0.000974 0.975816 1003.820092
0.25 2.485043e-05 0.024975 25.200378 25099.994780
0.50 4.975859e-05 0.049810 50.290946 50374.548980
0.75 7.486737e-05 0.074842 74.794537 75217.343920
0.99 9.897462e-05 0.098986 98.978245 98991.436977
1.00 9.897463e-05 0.098986 98.978263 98991.438985
In [384]: df2.count()
Out[384]:
one 90700
two 91600
three 63500
four 100000
dtype: int64
这些数字与上述不同,因为我保留了每列中没有丢失的所有数据(NaN)。