我正在使用pandas .json
从DataFrame
文件导入数据,结果有点破碎:
>> print df
summary response_date
8.0 {u'$date': u'2009-02-19T10:54:00.000+0000'}
11.0 {u'$date': u'2009-02-24T11:23:45.000+0000'}
14.0 {u'$date': u'2009-03-03T17:55:07.000+0000'}
16.0 {u'$date': u'2009-03-10T12:23:04.000+0000'}
19.0 {u'$date': u'2009-03-17T17:19:55.000+0000'}
13.0 {u'$date': u'2009-03-25T15:10:52.000+0000'}
22.0 {u'$date': u'2009-04-02T16:57:31.000+0100'}
15.0 {u'$date': u'2009-04-08T22:29:09.000+0100'}
20.0 {u'$date': u'2009-04-16T18:14:20.000+0100'}
13.0 {u'$date': u'2009-04-29T10:47:06.000+0100'}
15.0 {u'$date': u'2009-05-06T13:45:45.000+0100'}
20.0 {u'$date': u'2009-05-26T10:41:52.000+0100'}
如何摆脱'日期'和其他混乱,创建一个包含日期和时间的正常列。要转换ISO8601格式,我通常使用:
df.response_date = pd.to_datetime(df.response_date)
更新1
summary response_date closed_date open_date
24.0 2011-10-15T00:00:00.000+0100 NaN NaN
24.0 2011-11-24T09:00:00.000+0000 NaN NaN
19.0 2011-10-01T09:00:00.000+0100 NaN NaN
25.0 2011-10-29T09:00:00.000+0100 NaN NaN
19.0 2011-10-08T09:00:00.000+0100 NaN NaN
-1.0 2011-11-09T17:20:00.000+0000 {u'$date': u'2011-11-16T15:20:00.000+0000'} {u'$date': u'2011-11-09T15:20:00.000+0000'}
-1.0 2011-11-16T17:20:00.000+0000 {u'$date': u'2011-11-23T15:20:00.000+0000'} {u'$date': u'2011-11-16T15:20:00.000+0000'}
-1.0 2011-11-23T17:20:00.000+0000 {u'$date': u'2011-11-30T15:20:00.000+0000'} {u'$date': u'2011-11-23T15:20:00.000+0000'}
-1.0 2011-11-30T17:20:00.000+0000 {u'$date': u'2011-12-07T15:20:00.000+0000'} {u'$date': u'2011-11-30T15:20:00.000+0000'}
所以,
>> df.response_date = pd.DataFrame(df.response_date.values.tolist())
完美地工作,但其他列包含NaN值,并且用“-1”进行输入并没有帮助。
>> print type(df.ix[0,'scheduleClosedAt'])
<type 'int'>
更新2
为什么这个(屏蔽)方法不起作用?
>> df.reset_index(inplace=True)
>> indx_nan_closed = df.closed_date.isnull()
>> df[~indx_nan_closed].closed_date = pd.DataFrame(df[~indx_nan_closed].closed_date.values.tolist())
这一行等同于上面的那一行,但是有了掩蔽数组,所以我想把这个方法只应用于非NaN值,但结果是我的数据框“df”保持不变。这很奇怪。
有什么想法吗?
答案 0 :(得分:2)
如果DataFrame
为response_date
,您可以使用list
构造函数将type
列转换为dict
print (type(df.ix[0,'response_date']))
<class 'dict'>
df.response_date = pd.DataFrame(df.response_date.values.tolist())
df.response_date = pd.to_datetime(df.response_date)
print (df)
summary response_date
0 8.0 2009-02-19 10:54:00
1 11.0 2009-02-24 11:23:45
2 14.0 2009-03-03 17:55:07
:
type
如果string
为print (type(df.ix[0,'response_date']))
<class 'str'>
df.response_date = df.response_date.str.split().str[1].str.strip("'u}")
df.response_date = pd.to_datetime(df.response_date)
print (df)
summary response_date
0 8.0 2009-02-19 10:54:00
1 11.0 2009-02-24 11:23:45
2 14.0 2009-03-03 17:55:07
,请使用values
和split
:
dict
通过评论编辑:
2种可能的解决方案:
首先strip
为空df.closed_date = df.closed_date.fillna(pd.Series([{}]))
:
import numpy as np
import pandas as pd
df = pd.DataFrame({'summary':[19.0, -1.0,-1.0],
'response_date':['2011-10-08T09:00:00.000+0100','2011-11-09T17:20:00.000+0000','2011-11-16T17:20:00.000+0000'],
'closed_date':[np.nan, {u'$date': u'2011-11-16T15:20:00.000+0000'}, {u'$date': u'2011-11-23T15:20:00.000+0000'}]},
columns=['summary','response_date','closed_date'])
print (df)
summary response_date \
0 19.0 2011-10-08T09:00:00.000+0100
1 -1.0 2011-11-09T17:20:00.000+0000
2 -1.0 2011-11-16T17:20:00.000+0000
closed_date
0 NaN
1 {'$date': '2011-11-16T15:20:00.000+0000'}
2 {'$date': '2011-11-23T15:20:00.000+0000'}
另一个是fillna
:
a = df.ix[df.closed_date.notnull(), 'closed_date']
print (a)
1 {'$date': '2011-11-16T15:20:00.000+0000'}
2 {'$date': '2011-11-23T15:20:00.000+0000'}
Name: closed_date, dtype: object
df['closed_date'] = pd.DataFrame(a.values.tolist(), index=a.index)
df.closed_date = pd.to_datetime(df.closed_date)
print (df)
summary response_date closed_date
0 19.0 2011-10-08T09:00:00.000+0100 NaT
1 -1.0 2011-11-09T17:20:00.000+0000 2011-11-16 15:20:00
2 -1.0 2011-11-16T17:20:00.000+0000 2011-11-23 15:20:00
+.