Question

我有一个数据框，其中有一些生日，这些生日的常规日期与Excel序列日期混合在一起，如下所示：

09/01/2020 12:00:00 AM
05/15/1985 12:00:00 AM
06/07/2013 12:00:00 AM
33233
26299
29428

我尝试了this answer中的解决方案，所有Excel序列格式的日期都被清空，而保留了正常日期格式的日期。

这是我的代码：

import pandas as pd
import xlrd
import numpy as np
from numpy import *
from numpy.core import *
import os
import datetime
from datetime import datetime, timedelta
import glob

def from_excel_ordinal(ordinal, _epoch0=datetime(1899, 12, 31)):
    if ordinal >= 60:
        ordinal -= 1  # Excel leap year bug, 1900 is not a leap year!
    return (_epoch0 + timedelta(days=ordinal)).replace(microsecond=0)

path = 'C:\\Input'
os.chdir(path)
filelist = glob.glob('*BLAH*.xlsx')  
filename = os.fsdecode(filelist[0])
df = pd.read_excel(filename, sheet_name = 'Blah Blah') 
m = df['Birthday'].astype(str).str.isdigit()
df.loc[m, 'Birthday'] = df.loc[m, 'Birthday'].astype(int).apply(from_excel_ordinal)
df['Birthday'] = pd.to_datetime(df['Birthday'], errors = 'coerce')

我不确定这是哪里出了问题，因为代码不应该像这样做那样浪费生日。

Answer 1

不能以相同的方式解析所有日期
加载数据框
如果尚未将dates列设置为str。
使用Boolean Indexing选择不同的日期类型
- 假设常规日期为contain和/
- 假定Excel序列日期不包含/
根据日期时间类型分别修复每个数据框
Concat数据帧重新组合在一起。

import pandas as pd
from datetime import datetime

# load data
df = pd.DataFrame({'dates': ['09/01/2020', '05/15/1985', '06/07/2013', '33233', '26299', '29428']})

# display(df)

        dates
0  09/01/2020
1  05/15/1985
2  06/07/2013
3       33233
4       26299
5       29428

# set the column type as a str if it isn't already
df.dates = df.dates.astype('str')

# create a date mask based on the string containing a /
date_mask = df.dates.str.contains('/')

# split the dates out for excel
df_excel = df[~date_mask].copy()

# split the regular dates out
df_reg = df[date_mask].copy()

# convert reg dates to datetime
df_reg.dates = pd.to_datetime(df_reg.dates)

# convert excel dates to datetime; the column needs to be cast as ints
df_excel.dates = pd.TimedeltaIndex(df_excel.dates.astype(int), unit='d') + datetime(1900, 1, 1)

# combine the dataframes
df = pd.concat([df_reg, df_excel])

display（df）

       dates
0 2020-09-01
1 1985-05-15
2 2013-06-07
3 1990-12-28
4 1972-01-03
5 1980-07-28

Answer 2

pd.TimedeltaIndex（dates_in_excel_serial_format，unit ='d'）+ pd.datetime（1900,1,1）

演示：

> dates_in_excel_serial_format = [29428]
> pd.TimedeltaIndex(dates_in_excel_serial_format, unit='d') + pd.datetime(1900,1,1)
< DatetimeIndex(['1980-07-28'], dtype='datetime64[ns]', freq=None)

如何将带有Excel序列日期和常规日期的列转换为熊猫日期时间？

2 个答案:

display（df）