Python / Pandas DataFrame与datetime.timedelta的第一行问题

时间:2013-12-26 19:14:55

标签: python datetime pandas

REVISED: 看完后,似乎我的第一个问题是使用for循环-df.append方法创建数据帧。

使用print语句查看下面的“要附加的行”部分,然后我将变量分配给dict指针,并将dict附加到以空白数据帧开头的内容。

第一行/附加似乎改变了最后一列上第一个条目的类型。

  while read line
    if xx is not None and yy is not None and xx < yy:
       zz=yy-xx
       print (yy, xx, zz)
       parms['datein'] = xx
       parms['dateout'] = yy
       parms['time'] = zz
       df = df.append(parms,ignore_index=True)
       yy=None
       xx=None 

此时数据框在第一行和最后一列中有一个奇数条目

下面:

                datein             dateout            time
0  2013-11-01 06:10:00 2013-11-01 12:06:00  21360000000000
1  2013-11-01 12:51:00 2013-11-01 14:53:00         2:02:00
2  2013-11-04 06:02:00 2013-11-04 14:04:00         8:02:00
3  2013-11-05 05:56:00 2013-11-05 12:11:00         6:15:00

所以现在这是一个dataframe.append问题。

尝试将列中的datetime.timedeltas相加会产生错误:

TypeError: unsupported operand type(s) for +: 'long' and 'datetime.timedelta'

显示数据框本身会显示第一行没有出现在数据框附加的行中的问题。

要追加的行:

(datetime.datetime(2013, 11, 1, 12, 6), datetime.datetime(2013, 11, 1, 6, 10),datetime.timedelta(0, 21360))
(datetime.datetime(2013, 11, 1, 14, 53), datetime.datetime(2013, 11, 1, 12, 51),datetime.timedelta(0, 7320))
(datetime.datetime(2013, 11, 4, 14, 4), datetime.datetime(2013, 11, 4, 6, 2),datetime.timedelta(0, 28920))
(datetime.datetime(2013, 11, 5, 12, 11), datetime.datetime(2013, 11, 5, 5, 56),datetime.timedelta(0, 22500))
(datetime.datetime(2013, 11, 5, 14, 42), datetime.datetime(2013, 11, 5, 12, 38),datetime.timedelta(0, 7440))

使用以下命令从dat文件创建带有for循环的数据帧:

parms = dict.fromkeys(keys)
df = pd.DataFrame(columns=keys)
df = df.append(parms,ignore_index=True)

DataFrame输出:

                           time
dateout                            
2013-11-01 12:06:00  21360000000000
2013-11-01 14:53:00         2:02:00
2013-11-04 14:04:00         8:02:00
2013-11-05 12:11:00         6:15:00
2013-11-05 14:42:00         2:04:00
我正在使用     df.groupby(df.index.date)的.sum() 但似乎框架中的第一条线正在抛弃它。 关于为什么第一行显示奇怪的“长”引用的任何想法?

修订2

Dat文件:

IN    11/01/2013        14:32
OUT   11/01/2013        18:32
IN    11/01/2013        18:58
OUT   11/01/2013        20:57
IN    11/04/2013        14:33
OUT   11/04/2013        18:30
IN    11/04/2013        18:57
OUT   11/04/2013        23:01
IN    11/05/2013        14:29
OUT   11/05/2013        18:31
IN    11/05/2013        18:58
OUT   11/05/2013        23:01
IN    11/06/2013        14:30
OUT   11/06/2013        18:31
IN    11/06/2013        18:57
OUT   11/06/2013        23:00
IN    11/07/2013        14:30
OUT   11/07/2013        18:31

代码:

import numpy as np
import pandas as pd
import struct
from datetime import datetime 
keys = ['datein','dateout','time']
parms = dict.fromkeys(keys)
df = pd.DataFrame(columns=keys)
dat = open(datfile,'r')
for line in dat.readlines():

    opt, date, mgr, tim = line[:3], line[6:16], line[18:22], line[24:29]
    f = datetime.combine(datetime.strptime(date, '%m/%d/%Y'),datetime.strptime(tim, '%H:%M').time())
    if opt == 'IN ' and f > datetime(2013, 11, 1) and f < datetime(2013, 12, 1):
        xx = f

    if opt == 'OUT' and f > datetime(2013, 11, 1) and f < datetime(2013, 12, 1):
        yy = f 

    if xx is not None and yy is not None and xx < yy:
        zz=yy-xx
        print (yy, xx, zz)   ## <-- Check lines before appending dataframe, output below
        parms['datein'] = xx
        parms['dateout'] = yy
        parms['time'] = zz
        df = df.append(parms,ignore_index=True)

        if len(df.index) == 1:
            print df.dtypes   ## <---- This shows 'time' as timedelta64

        if len(df.index) == 2:
            print df.dtypes   ## after next line appended 'time shows 'object' and first line loses type.
        yy=None   ## <-- reset before next loop
        xx=None   ## <-- reset before next loop

print dg.dtypes
print dg

在附加到df之前从行打印返回:

(datetime.datetime(2013, 11, 4, 14, 4), datetime.datetime(2013, 11, 4, 6, 2),datetime.timedelta(0, 28920))
(datetime.datetime(2013, 11, 5, 12, 11), datetime.datetime(2013, 11, 5, 5, 56), datetime.timedelta(0, 22500))
(datetime.datetime(2013, 11, 5, 14, 42), datetime.datetime(2013, 11, 5, 12, 38), datetime.timedelta(0, 7440))
(datetime.datetime(2013, 11, 6, 12, 7), datetime.datetime(2013, 11, 6, 5, 49), datetime.timedelta(0, 22680))
(datetime.datetime(2013, 11, 6, 14, 37), datetime.datetime(2013, 11, 6, 12, 24), datetime.timedelta(0, 7980))
(datetime.datetime(2013, 11, 7, 14, 7), datetime.datetime(2013, 11, 7, 6, 8), datetime.timedelta(0, 28740))
(datetime.datetime(2013, 11, 8, 11, 58), datetime.datetime(2013, 11, 8, 5, 53), datetime.timedelta(0, 21900))
(datetime.datetime(2013, 11, 8, 14, 10), datetime.datetime(2013, 11, 8, 12, 21), datetime.timedelta(0, 6540))
(datetime.datetime(2013, 11, 11, 12, 16), datetime.datetime(2013, 11, 11, 6, 6), datetime.timedelta(0, 22200))
(datetime.datetime(2013, 11, 11, 14, 31), datetime.datetime(2013, 11, 11, 12, 49), datetime.timedelta(0, 6120))

和dtypes:

datein     datetime64[ns]
dateout    datetime64[ns]
time               object
dtype: object

显示第一行关闭的数据框:

                datein             dateout            time
0  2013-11-01 06:10:00 2013-11-01 12:06:00  21360000000000
1  2013-11-01 12:51:00 2013-11-01 14:53:00         2:02:00
2  2013-11-04 06:02:00 2013-11-04 14:04:00         8:02:00
3  2013-11-05 05:56:00 2013-11-05 12:11:00         6:15:00
4  2013-11-05 12:38:00 2013-11-05 14:42:00         2:04:00
5  2013-11-06 05:49:00 2013-11-06 12:07:00         6:18:00
6  2013-11-06 12:24:00 2013-11-06 14:37:00         2:13:00
7  2013-11-07 06:08:00 2013-11-07 14:07:00         7:59:00
8  2013-11-08 05:53:00 2013-11-08 11:58:00         6:05:00
9  2013-11-08 12:21:00 2013-11-08 14:10:00         1:49:00
10 2013-11-11 06:06:00 2013-11-11 12:16:00         6:10:00
11 2013-11-11 12:49:00 2013-11-11 14:31:00         1:42:00
12 2013-11-12 06:04:00 2013-11-12 12:24:00         6:20:00
13 2013-11-12 12:40:00 2013-11-12 12:59:00         0:19:00
14 2013-11-13 06:04:00 2013-11-13 12:19:00         6:15:00
15 2013-11-13 12:42:00 2013-11-13 14:35:00         1:53:00
16 2013-11-14 06:05:00 2013-11-14 12:22:00         6:17:00

2 个答案:

答案 0 :(得分:0)

这是使用pandas 0.13(现在0.13rc1)。 Timedelta支持在以前的版本中仅为最小。

In [23]: df
Out[23]: 
                date1               date2       td
0 2013-11-01 12:06:00 2013-11-01 06:10:00 05:56:00
1 2013-11-01 14:53:00 2013-11-01 12:51:00 02:02:00
2 2013-11-04 14:04:00 2013-11-04 06:02:00 08:02:00
3 2013-11-05 12:11:00 2013-11-05 05:56:00 06:15:00
4 2013-11-05 14:42:00 2013-11-05 12:38:00 02:04:00

[5 rows x 3 columns]

In [24]: df2 = df.set_index('date1')

In [25]: df2
Out[25]: 
                                  date2       td
date1                                           
2013-11-01 12:06:00 2013-11-01 06:10:00 05:56:00
2013-11-01 14:53:00 2013-11-01 12:51:00 02:02:00
2013-11-04 14:04:00 2013-11-04 06:02:00 08:02:00
2013-11-05 12:11:00 2013-11-05 05:56:00 06:15:00
2013-11-05 14:42:00 2013-11-05 12:38:00 02:04:00

[5 rows x 2 columns]

In [26]: pd.to_timedelta(df2.groupby(df2.index.date)['td'].sum())
Out[26]: 
2013-11-01   07:58:00
2013-11-04   08:02:00
2013-11-05   08:19:00
Name: td, dtype: timedelta64[ns]

即使这不是完全支持(因为我必须'将输出'转换回'timedeltas),这当前是groupby的一个未解决的问题,将修复为0.14,请参见此处:https://github.com/pydata/pandas/issues/5724

答案 1 :(得分:0)

这是a bug。在下面,可以看到一个比原始问题更短的例子。

设定:

import datetime
import pandas as pd
parms = {'d':  datetime.datetime(2013, 11, 5, 5, 56), 't':datetime.timedelta(0, 22500)}
df = pd.DataFrame(columns=list('dt'))
df = df.append(parms, ignore_index=True)

错误的代码:

>>> df.append(parms, ignore_index=True)
                    d               t
0 2013-11-05 05:56:00  22500000000000
1 2013-11-05 05:56:00         6:15:00