Question

我在压缩xlsx文件时遇到了一些日期问题。这些文件将加载到sqlite数据库中，然后导出为.csv。每个文件大约每天40,000行。我遇到的问题是pd.to_datetime似乎不适用于这些对象（Excel格式的日期导致了我认为的问题 - 纯.csv文件可以正常使用此命令）。实际上这很好 - 我不需要它们采用日期时间格式。

我想要实现的是创建名为ShortDate的列%m/%d/%Y。如何在日期时间对象上执行此操作（格式为mm / dd / yyyy hh：mm：来自Excel的ss）。然后，我将创建一个名为RosterID的新列，它将EmployeeID字段和ShortDate字段组合成一个唯一的ID。

我对pandas很新，我目前只使用它来处理.csv文件（重命名和选择某些列，创建在Tableau中使用过滤器的唯一ID等）。

rep = pd.read_csv(r'C:\Users\Desktop\test.csv.gz', dtype = 'str', compression = 'gzip', usecols = ['etc','etc2'])
print('Read successfully.')
rep['Total']=1
rep['UniqueID']= rep['EmployeeID'] + rep['InteractionID']
rep['ShortDate'] = ??? #what do I do here to get what I am looking for?
rep['RosterID']= rep['EmployeeID'] + rep['ShortDate'] # this is my goal
print('Modified successfully.')

以下是.csv的一些原始数据。列名称为

InteractionID, Created Date, EmployeeID, Repeat Date
07927,04/01/2014 14:05:10,912a,04/01/2014 14:50:03
02158,04/01/2014 13:44:05,172r,04/04/2014 17:47:29
44279,04/01/2014 17:28:36,217y,04/07/2014 22:06:19

Answer 1

您可以应用后处理步骤，首先将字符串转换为日期时间，然后应用lambda以仅保留日期部分：

In [29]:

df['Created Date'] = pd.to_datetime(df['Created Date']).apply(lambda x: x.date())
df['Repeat Date'] = pd.to_datetime(df['Repeat Date']).apply(lambda x: x.date())
df


Out[29]:
   InteractionID Created Date EmployeeID Repeat Date
0           7927   2014-04-01       912a  2014-04-01
1           2158   2014-04-01       172r  2014-04-04
2          44279   2014-04-01       217y  2014-04-07

修改

再次查看此内容后，如果您的熊猫版本大于dt.date，则可以使用0.15.0访问日期组件：

In [18]: df['just_date'] = df['Repeat Date'].dt.date df Out[18]: InteractionID Created Date EmployeeID Repeat Date \ 0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03 1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29 2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19 just_date 0 2014-04-01 1 2014-04-04 2 2014-04-07

此外，您现在也可以dt.strftime而不是使用apply来达到您想要的结果：

In [28]: df['short_date'] = df['Repeat Date'].dt.strftime('%m%d%Y') df Out[28]: InteractionID Created Date EmployeeID Repeat Date \ 0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03 1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29 2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19 just_date short_date 0 2014-04-01 04012014 1 2014-04-04 04042014 2 2014-04-07 04072014

因此，生成名册ID现在是添加2个新列的一个微不足道的练习：

In [30]: df['Roster ID'] = df['EmployeeID'] + df['short_date'] df Out[30]: InteractionID Created Date EmployeeID Repeat Date \ 0 7927 2014-04-01 14:05:10 912a 2014-04-01 14:50:03 1 2158 2014-04-01 13:44:05 172r 2014-04-04 17:47:29 2 44279 2014-04-01 17:28:36 217y 2014-04-07 22:06:19 just_date short_date Roster ID 0 2014-04-01 04012014 912a04012014 1 2014-04-04 04042014 172r04042014 2 2014-04-07 04072014 217y04072014

Answer 2

创建新列，然后只使用datetime和lambda应用简单的apply函数。

In [14]: df['Short Date']= pd.to_datetime(df['Created Date'])

In [15]: df
Out[15]: 
   InteractionID    Created Date EmployeeID     Repeat Date  \
0           7927  4/1/2014 14:05       912a  4/1/2014 14:50   
1           2158  4/1/2014 13:44       172r  4/4/2014 17:47   
2          44279  4/1/2014 17:28       217y  4/7/2014 22:06   

           Short Date  
0 2014-04-01 14:05:00  
1 2014-04-01 13:44:00  
2 2014-04-01 17:28:00  

In [16]: df['Short Date'] = df['Short Date'].apply(lambda x:x.date().strftime('%m%d%y'))

In [17]: df
Out[17]: 
   InteractionID    Created Date EmployeeID     Repeat Date Short Date  
0           7927  4/1/2014 14:05       912a  4/1/2014 14:50     040114   
1           2158  4/1/2014 13:44       172r  4/4/2014 17:47     040114   
2          44279  4/1/2014 17:28       217y  4/7/2014 22:06     040114

然后只连接两列。将Short Date列转换为字符串，以避免串联和整数连接时出错。

In [32]: df['Roster ID'] = df['EmployeeID'] + df['Short Date'].map(str)

In [33]: df
Out[33]: 
   InteractionID    Created Date EmployeeID     Repeat Date Short Date  \
0           7927  4/1/2014 14:05       912a  4/1/2014 14:50     040114   
1           2158  4/1/2014 13:44       172r  4/4/2014 17:47     040114   
2          44279  4/1/2014 17:28       217y  4/7/2014 22:06     040114   

    Roster ID  
0  912a040114  
1  172r040114  
2  217y040114

Answer 3

您也可以只使用标准库（您想要的任何格式＆＃39;％m /％d /％Y＆＃39;，＆＃39;％m-％d-％Y＆＃39;或其他订单/格式）：

In [118]:

import time
df['Created Date'] = df['Created Date'].apply(lambda x: time.strftime('%m/%d/%Y', time.strptime(x, '%m/%d/%Y %H:%M:%S')))
In [120]:

print df
   InteractionID Created Date EmployeeID          Repeat Date
0           7927   04/01/2014       912a  04/01/2014 14:50:03
1           2158   04/01/2014       172r  04/04/2014 17:47:29
2          44279   04/01/2014       217y  04/07/2014 22:06:19

从熊猫中的对象日期剥离时间

3 个答案: