REVISED: 看完后,似乎我的第一个问题是使用for循环-df.append方法创建数据帧。
使用print语句查看下面的“要附加的行”部分,然后我将变量分配给dict指针,并将dict附加到以空白数据帧开头的内容。
第一行/附加似乎改变了最后一列上第一个条目的类型。
while read line
if xx is not None and yy is not None and xx < yy:
zz=yy-xx
print (yy, xx, zz)
parms['datein'] = xx
parms['dateout'] = yy
parms['time'] = zz
df = df.append(parms,ignore_index=True)
yy=None
xx=None
此时数据框在第一行和最后一列中有一个奇数条目。
下面:
datein dateout time
0 2013-11-01 06:10:00 2013-11-01 12:06:00 21360000000000
1 2013-11-01 12:51:00 2013-11-01 14:53:00 2:02:00
2 2013-11-04 06:02:00 2013-11-04 14:04:00 8:02:00
3 2013-11-05 05:56:00 2013-11-05 12:11:00 6:15:00
所以现在这是一个dataframe.append问题。
尝试将列中的datetime.timedeltas相加会产生错误:
TypeError: unsupported operand type(s) for +: 'long' and 'datetime.timedelta'
显示数据框本身会显示第一行没有出现在数据框附加的行中的问题。
要追加的行:
(datetime.datetime(2013, 11, 1, 12, 6), datetime.datetime(2013, 11, 1, 6, 10),datetime.timedelta(0, 21360))
(datetime.datetime(2013, 11, 1, 14, 53), datetime.datetime(2013, 11, 1, 12, 51),datetime.timedelta(0, 7320))
(datetime.datetime(2013, 11, 4, 14, 4), datetime.datetime(2013, 11, 4, 6, 2),datetime.timedelta(0, 28920))
(datetime.datetime(2013, 11, 5, 12, 11), datetime.datetime(2013, 11, 5, 5, 56),datetime.timedelta(0, 22500))
(datetime.datetime(2013, 11, 5, 14, 42), datetime.datetime(2013, 11, 5, 12, 38),datetime.timedelta(0, 7440))
使用以下命令从dat文件创建带有for循环的数据帧:
parms = dict.fromkeys(keys)
df = pd.DataFrame(columns=keys)
df = df.append(parms,ignore_index=True)
DataFrame输出:
time
dateout
2013-11-01 12:06:00 21360000000000
2013-11-01 14:53:00 2:02:00
2013-11-04 14:04:00 8:02:00
2013-11-05 12:11:00 6:15:00
2013-11-05 14:42:00 2:04:00
我正在使用
df.groupby(df.index.date)的.sum()
但似乎框架中的第一条线正在抛弃它。
关于为什么第一行显示奇怪的“长”引用的任何想法?
修订2
Dat文件:
IN 11/01/2013 14:32
OUT 11/01/2013 18:32
IN 11/01/2013 18:58
OUT 11/01/2013 20:57
IN 11/04/2013 14:33
OUT 11/04/2013 18:30
IN 11/04/2013 18:57
OUT 11/04/2013 23:01
IN 11/05/2013 14:29
OUT 11/05/2013 18:31
IN 11/05/2013 18:58
OUT 11/05/2013 23:01
IN 11/06/2013 14:30
OUT 11/06/2013 18:31
IN 11/06/2013 18:57
OUT 11/06/2013 23:00
IN 11/07/2013 14:30
OUT 11/07/2013 18:31
代码:
import numpy as np
import pandas as pd
import struct
from datetime import datetime
keys = ['datein','dateout','time']
parms = dict.fromkeys(keys)
df = pd.DataFrame(columns=keys)
dat = open(datfile,'r')
for line in dat.readlines():
opt, date, mgr, tim = line[:3], line[6:16], line[18:22], line[24:29]
f = datetime.combine(datetime.strptime(date, '%m/%d/%Y'),datetime.strptime(tim, '%H:%M').time())
if opt == 'IN ' and f > datetime(2013, 11, 1) and f < datetime(2013, 12, 1):
xx = f
if opt == 'OUT' and f > datetime(2013, 11, 1) and f < datetime(2013, 12, 1):
yy = f
if xx is not None and yy is not None and xx < yy:
zz=yy-xx
print (yy, xx, zz) ## <-- Check lines before appending dataframe, output below
parms['datein'] = xx
parms['dateout'] = yy
parms['time'] = zz
df = df.append(parms,ignore_index=True)
if len(df.index) == 1:
print df.dtypes ## <---- This shows 'time' as timedelta64
if len(df.index) == 2:
print df.dtypes ## after next line appended 'time shows 'object' and first line loses type.
yy=None ## <-- reset before next loop
xx=None ## <-- reset before next loop
print dg.dtypes
print dg
在附加到df之前从行打印返回:
(datetime.datetime(2013, 11, 4, 14, 4), datetime.datetime(2013, 11, 4, 6, 2),datetime.timedelta(0, 28920))
(datetime.datetime(2013, 11, 5, 12, 11), datetime.datetime(2013, 11, 5, 5, 56), datetime.timedelta(0, 22500))
(datetime.datetime(2013, 11, 5, 14, 42), datetime.datetime(2013, 11, 5, 12, 38), datetime.timedelta(0, 7440))
(datetime.datetime(2013, 11, 6, 12, 7), datetime.datetime(2013, 11, 6, 5, 49), datetime.timedelta(0, 22680))
(datetime.datetime(2013, 11, 6, 14, 37), datetime.datetime(2013, 11, 6, 12, 24), datetime.timedelta(0, 7980))
(datetime.datetime(2013, 11, 7, 14, 7), datetime.datetime(2013, 11, 7, 6, 8), datetime.timedelta(0, 28740))
(datetime.datetime(2013, 11, 8, 11, 58), datetime.datetime(2013, 11, 8, 5, 53), datetime.timedelta(0, 21900))
(datetime.datetime(2013, 11, 8, 14, 10), datetime.datetime(2013, 11, 8, 12, 21), datetime.timedelta(0, 6540))
(datetime.datetime(2013, 11, 11, 12, 16), datetime.datetime(2013, 11, 11, 6, 6), datetime.timedelta(0, 22200))
(datetime.datetime(2013, 11, 11, 14, 31), datetime.datetime(2013, 11, 11, 12, 49), datetime.timedelta(0, 6120))
和dtypes:
datein datetime64[ns]
dateout datetime64[ns]
time object
dtype: object
显示第一行关闭的数据框:
datein dateout time
0 2013-11-01 06:10:00 2013-11-01 12:06:00 21360000000000
1 2013-11-01 12:51:00 2013-11-01 14:53:00 2:02:00
2 2013-11-04 06:02:00 2013-11-04 14:04:00 8:02:00
3 2013-11-05 05:56:00 2013-11-05 12:11:00 6:15:00
4 2013-11-05 12:38:00 2013-11-05 14:42:00 2:04:00
5 2013-11-06 05:49:00 2013-11-06 12:07:00 6:18:00
6 2013-11-06 12:24:00 2013-11-06 14:37:00 2:13:00
7 2013-11-07 06:08:00 2013-11-07 14:07:00 7:59:00
8 2013-11-08 05:53:00 2013-11-08 11:58:00 6:05:00
9 2013-11-08 12:21:00 2013-11-08 14:10:00 1:49:00
10 2013-11-11 06:06:00 2013-11-11 12:16:00 6:10:00
11 2013-11-11 12:49:00 2013-11-11 14:31:00 1:42:00
12 2013-11-12 06:04:00 2013-11-12 12:24:00 6:20:00
13 2013-11-12 12:40:00 2013-11-12 12:59:00 0:19:00
14 2013-11-13 06:04:00 2013-11-13 12:19:00 6:15:00
15 2013-11-13 12:42:00 2013-11-13 14:35:00 1:53:00
16 2013-11-14 06:05:00 2013-11-14 12:22:00 6:17:00
答案 0 :(得分:0)
这是使用pandas 0.13(现在0.13rc1)。 Timedelta支持在以前的版本中仅为最小。
In [23]: df
Out[23]:
date1 date2 td
0 2013-11-01 12:06:00 2013-11-01 06:10:00 05:56:00
1 2013-11-01 14:53:00 2013-11-01 12:51:00 02:02:00
2 2013-11-04 14:04:00 2013-11-04 06:02:00 08:02:00
3 2013-11-05 12:11:00 2013-11-05 05:56:00 06:15:00
4 2013-11-05 14:42:00 2013-11-05 12:38:00 02:04:00
[5 rows x 3 columns]
In [24]: df2 = df.set_index('date1')
In [25]: df2
Out[25]:
date2 td
date1
2013-11-01 12:06:00 2013-11-01 06:10:00 05:56:00
2013-11-01 14:53:00 2013-11-01 12:51:00 02:02:00
2013-11-04 14:04:00 2013-11-04 06:02:00 08:02:00
2013-11-05 12:11:00 2013-11-05 05:56:00 06:15:00
2013-11-05 14:42:00 2013-11-05 12:38:00 02:04:00
[5 rows x 2 columns]
In [26]: pd.to_timedelta(df2.groupby(df2.index.date)['td'].sum())
Out[26]:
2013-11-01 07:58:00
2013-11-04 08:02:00
2013-11-05 08:19:00
Name: td, dtype: timedelta64[ns]
即使这不是完全支持(因为我必须'将输出'转换回'timedeltas),这当前是groupby的一个未解决的问题,将修复为0.14,请参见此处:https://github.com/pydata/pandas/issues/5724
答案 1 :(得分:0)
这是a bug。在下面,可以看到一个比原始问题更短的例子。
设定:
import datetime
import pandas as pd
parms = {'d': datetime.datetime(2013, 11, 5, 5, 56), 't':datetime.timedelta(0, 22500)}
df = pd.DataFrame(columns=list('dt'))
df = df.append(parms, ignore_index=True)
错误的代码:
>>> df.append(parms, ignore_index=True)
d t
0 2013-11-05 05:56:00 22500000000000
1 2013-11-05 05:56:00 6:15:00