我有一个包含11列的大型数据框,我想用零替换NaN值,如果另一组列中的每个值都是NaN,否则将非空数转换为整数。我是按照以下方式进行此操作的,但只有8000次观察,需要很长时间才能完成(尽管它正确地完成)。我估计这花了将近20分钟:
lt = ['lost_time_a', 'lost_time_b', 'lost_time_c', 'lost_time_d', 'lost_time_e', 'lost_time_f', 'lost_time_g',
'lost_time_h', 'lost_time_i', 'lost_time_j', 'ttl']
ht = ['hour1', 'hour2', 'hour3', 'hour4', 'hour5', 'hour6', 'hour7', 'hour8', 'hour9', 'hour10', 'hour11',
'hour12', 'hour13', 'hour14', 'hour15']
for row in FinalDF.index:
if not all([pd.isnull(FinalDF.loc[row, col]) for col in ht]):
for Col_ in lt:
val = FinalDF.loc[row, Col_]
if pd.isnull(val):
FinalDF.loc[row, Col_] = 0
else:
FinalDF.loc[row, Col_] = int(val)
所有帮助表示赞赏
以下是您的一些测试数据:
import pandas as pd
import numpy as np
from numpy import nan as NA
FinalDF = pd.DataFrame({'hour1' : [NA, NA, NA, 70, 60],
'hour2' : [100, 50, NA, 120, 100],
'hour3' : [120, 80, NA, 130, 100],
'hour4' : [140, 90, NA, 120, 70],
'hour5' : [130, 200, NA, NA, NA],
'hour6' : [NA, NA, NA, 70, 60],
'hour7' : [100, 50, NA, 120, 100],
'hour8' : [120, 80, NA, 130, 100],
'hour9' : [140, 90, NA, 120, 70,],
'hour10' :[130, 200, NA, NA, NA],
'hour11' : [NA, NA, NA, 70, 60],
'hour12' : [100, 50, NA, 120, 100],
'hour13' : [120, 80, NA, 130, 100],
'hour14' : [140, 90, NA, 120, 70],
'hour15' : [130, 200, NA, NA, NA],
'lost_time_a' : [NA, NA, NA, NA, NA],
'lost_time_b' : [NA, 1.0, NA, NA, 4.1],
'lost_time_c' : [NA, NA, NA, NA, 10.1],
'lost_time_d' : [1, 2.3, NA, NA, 1],
'lost_time_e' : [NA, NA, NA, NA, NA],
'lost_time_f' : [NA, 1.0, NA, NA, 4.1],
'lost_time_g' : [NA, NA, NA, NA, 10.1],
'lost_time_h' : [1, 2.3, NA, NA, 1],
'lost_time_i' : [NA, NA, NA, NA, NA],
'lost_time_j' : [NA, 1.0, NA, NA, 4.1],
'ttl' : [NA, NA, NA, NA, NA]})
部分输出(丢失时间变量)
Out[18]:
lost_time_a lost_time_b lost_time_c lost_time_d lost_time_e
0 0 0 0 1 0
1 0 1 0 2 0
2 NaN NaN NaN NaN NaN
3 0 0 0 0 0
4 0 4 10 1 0
答案 0 :(得分:2)
我认为这会产生与您的代码相同的结果:
def fix(df, ht, lt):
df = df.copy()
to_fix = ~df[ht].isnull().all(axis=1), lt
df.loc[to_fix] = df.loc[to_fix].fillna(0).astype(int)
return df
(显然,如果您对原地更改感到满意,可以放弃副本。)
>>> df.iloc[:,-5:]
lost_time_g lost_time_h lost_time_i lost_time_j ttl
0 NaN 1.0 NaN NaN NaN
1 NaN 2.3 NaN 1.0 NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 10.1 1.0 NaN 4.1 NaN
>>> fix(df, ht, lt).iloc[:, -5:]
lost_time_g lost_time_h lost_time_i lost_time_j ttl
0 0 1 0 0 0
1 0 2 0 1 0
2 NaN NaN NaN NaN NaN
3 0 0 0 0 0
4 10 1 0 4 0
>>> from pandas.util.testing import assert_frame_equal
>>> assert_frame_equal(orig(df, ht, lt), fix(df, ht, lt))
>>>
答案 1 :(得分:1)
未经测试,但我认为这会做你想要的? cond
是一个布尔系列,当ht
中的所有列都为空时,该系列为真。
for c in lt:
cond = pd.isnull(FinalDF[ht]).all(axis=1)
FinalDF[c] = np.where(cond, FinalDF[c].fillna(0).astype(int), FinalDF[c])