在DataFrame列中计算NaNs窗口(及其大小)

时间:2019-08-30 16:12:55

标签: python python-3.x dataframe nan

我在列上有巨大的数据帧(百万,数万)和很多缺失(NaN)值。 我需要以最快的方式计算每一列NaN的窗口及其大小(我的代码太慢了。)。

这样的事情:从这里开始

import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
               'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
               'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})

df
Out[65]: 
 a    b    c
0  1.0  NaN  NaN
1  2.0  2.0  2.0
2  NaN  1.0  1.0
3  NaN  1.0  NaN
4  3.0  3.0  3.0
5  3.0  3.0  3.0
6  NaN  NaN  NaN
7  4.0  NaN  NaN
8  NaN  2.0  2.0
9  NaN  NaN  8.0

到这里:

result
Out[61]: 
    a  b  c
 0  2  1  1
 1  1  2  1
 2  2  1  2

2 个答案:

答案 0 :(得分:0)

这里是一种方法:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a':[1,2, np.nan, np.nan,3,3,np.nan,4,np.nan,np.nan],\
               'b':[np.nan, 2, 1, 1, 3, 3, np.nan, np.nan,2, np.nan],\
               'c':[np.nan, 2, 1, np.nan, 3, 3, np.nan, np.nan,2, 8]})
df_n = pd.DataFrame({'a':df['a'].isnull().values,
                      'b':df['b'].isnull().values,
                      'c':df['c'].isnull().values})

pr={}
for column_name, _ in df_n.iteritems():

    fst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(1).fillna(False)]
    lst = df_n.index[df_n[column_name] & ~ df_n[column_name].shift(-1).fillna(False)]

    pr[column_name] = [j-i+1 for i, j in zip(fst, lst)]

df_new=pd.DataFrame(pr)

输出:

    a   b   c
0   2   1   1
1   1   2   1
2   2   1   2

答案 1 :(得分:0)

尝试此操作(仅用于a的示例-类似地用于其他列):

>>> df=df.assign(a_count_sum=0)
>>> df["a_count_sum"][np.isnan(df["a"])]=df.groupby(np.isnan(df.a)).cumcount()+1
>>> df
     a    b    c  a_count_sum
0  1.0  NaN  NaN            0
1  2.0  2.0  2.0            0
2  NaN  1.0  1.0            1
3  NaN  1.0  NaN            2
4  3.0  3.0  3.0            0
5  3.0  3.0  3.0            0
6  NaN  NaN  NaN            3
7  4.0  NaN  NaN            0
8  NaN  2.0  2.0            4
9  NaN  NaN  8.0            5
>>> res_1 = df["a_count_sum"][((df["a_count_sum"].shift(-1) == 0) | (np.isnan(df["a_count_sum"].shift(-1)))) & (df["a_count_sum"]!=0)]
>>> res_1
3    2
6    3
9    5
Name: a_count_sum, dtype: int64
>>> res_2 = (-res_1.shift(1).fillna(0)).astype(np.int64)
>>> res_2
3    0
6   -2
9   -3
Name: a_count_sum, dtype: int64
>>> res=res_1+res_2
>>> res
3    2
6    1
9    2
Name: a_count_sum, dtype: int64