有条件的NaN填充

时间:2014-07-20 13:28:50

标签: python-2.7 pandas null

我有一个包含11列的大型数据框,我想用零替换NaN值,如果另一组列中的每个值都是NaN,否则将非空数转换为整数。我是按照以下方式进行此操作的,但只有8000次观察,需要很长时间才能完成(尽管它正确地完成)。我估计这花了将近20分钟:

lt = ['lost_time_a', 'lost_time_b', 'lost_time_c', 'lost_time_d', 'lost_time_e', 'lost_time_f', 'lost_time_g',
      'lost_time_h', 'lost_time_i', 'lost_time_j', 'ttl']
ht = ['hour1', 'hour2', 'hour3', 'hour4', 'hour5', 'hour6', 'hour7', 'hour8', 'hour9', 'hour10', 'hour11',
      'hour12', 'hour13', 'hour14', 'hour15']

for row in FinalDF.index:
    if not all([pd.isnull(FinalDF.loc[row, col]) for col in ht]):
        for Col_ in lt:
            val = FinalDF.loc[row, Col_]
            if pd.isnull(val):
                FinalDF.loc[row, Col_] = 0
            else:
                FinalDF.loc[row, Col_] = int(val)

所有帮助表示赞赏

以下是您的一些测试数据:

import pandas as pd
import numpy as np
from numpy import nan as NA
FinalDF = pd.DataFrame({'hour1' : [NA, NA, NA, 70, 60],
                   'hour2' : [100, 50, NA, 120, 100],
                   'hour3' : [120, 80, NA, 130, 100],
                   'hour4' : [140, 90, NA, 120, 70],
                   'hour5' : [130, 200, NA, NA, NA],
                   'hour6' : [NA, NA, NA, 70, 60],
                   'hour7' : [100, 50, NA, 120, 100],
                   'hour8' : [120, 80, NA, 130, 100],
                   'hour9' : [140, 90, NA, 120, 70,],
                   'hour10' :[130, 200, NA, NA, NA],
                   'hour11' : [NA, NA, NA, 70, 60],
                   'hour12' : [100, 50, NA, 120, 100],
                   'hour13' : [120, 80, NA, 130, 100],
                   'hour14' : [140, 90, NA, 120, 70],
                   'hour15' : [130, 200, NA, NA, NA],
                   'lost_time_a' : [NA, NA, NA, NA, NA],
                   'lost_time_b' : [NA, 1.0, NA, NA, 4.1],
                   'lost_time_c' : [NA, NA, NA, NA, 10.1],
                   'lost_time_d' : [1, 2.3, NA, NA, 1],
                   'lost_time_e' : [NA, NA, NA, NA, NA],
                   'lost_time_f' : [NA, 1.0, NA, NA, 4.1],
                   'lost_time_g' : [NA, NA, NA, NA, 10.1],
                   'lost_time_h' : [1, 2.3, NA, NA, 1],
                   'lost_time_i' : [NA, NA, NA, NA, NA],
                   'lost_time_j' : [NA, 1.0, NA, NA, 4.1],
                   'ttl'         : [NA, NA, NA, NA, NA]})

部分输出(丢失时间变量)

Out[18]:
   lost_time_a  lost_time_b  lost_time_c  lost_time_d  lost_time_e
0            0            0            0            1            0
1            0            1            0            2            0
2          NaN          NaN          NaN          NaN          NaN
3            0            0            0            0            0
4            0            4           10            1            0

2 个答案:

答案 0 :(得分:2)

我认为这会产生与您的代码相同的结果:

def fix(df, ht, lt):
    df = df.copy()
    to_fix = ~df[ht].isnull().all(axis=1), lt
    df.loc[to_fix] = df.loc[to_fix].fillna(0).astype(int)
    return df

(显然,如果您对原地更改感到满意,可以放弃副本。)

>>> df.iloc[:,-5:]
   lost_time_g  lost_time_h  lost_time_i  lost_time_j  ttl
0          NaN          1.0          NaN          NaN  NaN
1          NaN          2.3          NaN          1.0  NaN
2          NaN          NaN          NaN          NaN  NaN
3          NaN          NaN          NaN          NaN  NaN
4         10.1          1.0          NaN          4.1  NaN
>>> fix(df, ht, lt).iloc[:, -5:]
   lost_time_g  lost_time_h  lost_time_i  lost_time_j  ttl
0            0            1            0            0    0
1            0            2            0            1    0
2          NaN          NaN          NaN          NaN  NaN
3            0            0            0            0    0
4           10            1            0            4    0
>>> from pandas.util.testing import assert_frame_equal
>>> assert_frame_equal(orig(df, ht, lt), fix(df, ht, lt))
>>>

答案 1 :(得分:1)

未经测试,但我认为这会做你想要的? cond是一个布尔系列,当ht中的所有列都为空时,该系列为真。

for c in lt:
    cond = pd.isnull(FinalDF[ht]).all(axis=1)
    FinalDF[c] = np.where(cond, FinalDF[c].fillna(0).astype(int), FinalDF[c])