我正在Pandas中创建一个数据框 -
df_data = dict()
for x in data:
series = pandas.Series(x['value']['values'], index=x['value']['timestamps'])
df_data[x['_id']] = series
df = pandas.DataFrame(df_data)
data
是格式列表 -
{u'_id': u'770000000049',
u'value': {u'timestamps': [datetime.datetime(2012, 7, 25, 10, 16, 1, 270000),
datetime.datetime(2012, 7, 25, 10, 18, 29, 745000),
datetime.datetime(2012, 7, 25, 10, 21, 54, 931000),
datetime.datetime(2012, 7, 25, 10, 23, 18, 896000)],
u'values': [204.0, 16.788, 139.2, 116.004]}}
打印示例系列给了我 -
>>> print df_data['770000000049']
>>> 2012-07-25 10:16:01.270000 204.000
2012-07-25 10:18:29.745000 16.788
2012-07-25 10:21:54.931000 139.200
2012-07-25 10:23:18.896000 116.004
正如所料。但是,打印结果数据框会给我 -
>>> print df['770000000049']
>>> 1992-06-05 15:50:11.527680 NaN
2181-10-17 22:55:34.850625 NaN
2215-08-27 21:41:15.306049 NaN
1936-05-22 00:55:45.848401 NaN
1783-06-08 06:38:26.257076 NaN
2017-03-12 18:30:17.469108 NaN
2209-08-06 03:45:09.779652 NaN
1768-02-06 12:00:22.653272 NaN
1916-07-20 06:51:31.628376 NaN
2086-01-25 18:30:58.261336 NaN
1940-08-26 15:13:33.790568 NaN
1712-12-17 22:48:01.743241 NaN
1803-06-16 16:32:58.309017 NaN
1981-11-05 04:38:27.140059 NaN
2246-05-25 09:09:27.875035 NaN
...
WTF!数据都错了。键和值都是完全错误的。
我做错了什么?
修改:打印df
给了我 -
DatetimeIndex: 386 entries, 1992-06-05 15:50:11.527680 to 1774-08-13 02:00:15.237103
Data columns:
770000000006 0 non-null values
770000000009 0 non-null values
770000000010 0 non-null values
770000000011 0 non-null values
770000000012 0 non-null values
770000000013 0 non-null values
770000000018 0 non-null values
770000000020 0 non-null values
770000000021 0 non-null values
770000000022 0 non-null values
770000000024 0 non-null values
770000000029 0 non-null values
770000000030 0 non-null values
770000000032 0 non-null values
770000000034 0 non-null values
770000000049 0 non-null values
dtypes: float64(16)
完全错误
编辑2 :
我written a module为我重现了这个错误。
答案 0 :(得分:1)
编辑: 是一个错误。我(Wes)在此修复了它:https://github.com/pydata/pandas/commit/aea7c4522bd7beffd0df80efee818873110609fa
结果是it's not a bug -
虽然pandas不会强制您使用已排序的日期索引,但如果日期未排序,则其中某些方法可能会出现意外或不正确的行为。所以请小心。
在数据库级别对日期进行排序为我解决了问题。
答案 1 :(得分:0)
我跑了你粘贴的片段,对我来说似乎很好。您使用的是哪个版本的pandas / numpy?你能发布所有/更多的数据吗?
In [26]: paste
{u'_id': u'770000000049',
u'value': {u'timestamps': [datetime.datetime(2012, 7, 25, 10, 16, 1, 270000),
datetime.datetime(2012, 7, 25, 10, 18, 29, 745000),
datetime.datetime(2012, 7, 25, 10, 21, 54, 931000),
datetime.datetime(2012, 7, 25, 10, 23, 18, 896000)],
u'values': [204.0, 16.788, 139.2, 116.004]}}
## -- End pasted text --
Out[26]:
{u'_id': u'770000000049',
u'value': {u'timestamps': [datetime.datetime(2012, 7, 25, 10, 16, 1, 270000),
datetime.datetime(2012, 7, 25, 10, 18, 29, 745000),
datetime.datetime(2012, 7, 25, 10, 21, 54, 931000),
datetime.datetime(2012, 7, 25, 10, 23, 18, 896000)],
u'values': [204.0, 16.788, 139.2, 116.004]}}
In [27]: data = [_]
In [28]: paste
df_data = dict()
for x in data:
series = pandas.Series(x['value']['values'], index=x['value']['timestamps'])
df_data[x['_id']] = series
df = pandas.DataFrame(df_data)
## -- End pasted text --
In [29]: print df['770000000049']
2012-07-25 10:16:01.270000 204.000
2012-07-25 10:18:29.745000 16.788
2012-07-25 10:21:54.931000 139.200
2012-07-25 10:23:18.896000 116.004
Name: 770000000049