我有一个这样的数据框,
datecol1 datecol2
2005-02-22 EmployeeNotFound
2010-02-21 2010-02-22
EmployeeNotFound EmployeeNotFound
EmployeeNotFound 2010-02-22
这两个列都具有Object的dtype。
我想比较这两列,并获取每一列的最大日期。
所以预期的结果是
datecol1 datecol2 ExpectedResult
2005-02-22 EmployeeNotFound 2005-02-22
2010-02-21 2010-02-22 2010-02-22
EmployeeNotFound EmployeeNotFound EmployeeNotFound
EmployeeNotFound 2010-02-25 2010-02-25
ExpectedResult的dtype将再次是一个对象。
答案 0 :(得分:2)
将列转换为日期时间,每个轴1获取max
,最后转换为字符串并替换NaT
:
cols = ['datecol1', 'datecol2']
df[cols] = df[cols].apply(pd.to_datetime, errors='coerce')
df['ExpectedResult'] = df[cols].max(axis=1)
df = df.astype(str).replace('NaT','EmployeeNotFound')
#alternative solution
#df = df.astype(str).mask(df.isnull(),'EmployeeNotFound')
print (df)
datecol1 datecol2 ExpectedResult
0 2005-02-22 EmployeeNotFound 2005-02-22
1 2010-02-21 2010-02-22 2010-02-22
2 EmployeeNotFound EmployeeNotFound EmployeeNotFound
3 EmployeeNotFound 2010-02-22 2010-02-22
答案 1 :(得分:0)
您也可以使用numpy,因为numpy函数的速度更快。
import numpy as np
cond = df['datecol1'] != 'EmployeeNotFound'
df['ExpectedResult'] = np.where(cond, df['datecol1'], df['datecol2'])
首先,datecol1
的所有有效值将被填充,然后其余的将由第二列datecol2
填充。