熊猫搞乱了数据框架

时间:2012-08-07 14:50:43

标签: python numpy pandas

我正在Pandas中创建一个数据框 -

df_data = dict()

for x in data:
    series = pandas.Series(x['value']['values'], index=x['value']['timestamps'])

    df_data[x['_id']] = series

df = pandas.DataFrame(df_data)

data是格式列表 -

{u'_id': u'770000000049',
 u'value': {u'timestamps': [datetime.datetime(2012, 7, 25, 10, 16, 1, 270000),
                            datetime.datetime(2012, 7, 25, 10, 18, 29, 745000),
                            datetime.datetime(2012, 7, 25, 10, 21, 54, 931000),
                            datetime.datetime(2012, 7, 25, 10, 23, 18, 896000)],
            u'values': [204.0, 16.788, 139.2, 116.004]}}

打印示例系列给了我 -

>>> print df_data['770000000049']

>>> 2012-07-25 10:16:01.270000    204.000
2012-07-25 10:18:29.745000     16.788
2012-07-25 10:21:54.931000    139.200
2012-07-25 10:23:18.896000    116.004

正如所料。但是,打印结果数据框会给我 -

>>> print df['770000000049']

>>> 1992-06-05 15:50:11.527680   NaN
2181-10-17 22:55:34.850625   NaN
2215-08-27 21:41:15.306049   NaN
1936-05-22 00:55:45.848401   NaN
1783-06-08 06:38:26.257076   NaN
2017-03-12 18:30:17.469108   NaN
2209-08-06 03:45:09.779652   NaN
1768-02-06 12:00:22.653272   NaN
1916-07-20 06:51:31.628376   NaN
2086-01-25 18:30:58.261336   NaN
1940-08-26 15:13:33.790568   NaN
1712-12-17 22:48:01.743241   NaN
1803-06-16 16:32:58.309017   NaN
1981-11-05 04:38:27.140059   NaN
2246-05-25 09:09:27.875035   NaN
...

WTF!数据都错了。键和值都是完全错误的。

我做错了什么?

修改:打印df给了我 -

DatetimeIndex: 386 entries, 1992-06-05 15:50:11.527680 to 1774-08-13 02:00:15.237103
Data columns:
770000000006    0  non-null values
770000000009    0  non-null values
770000000010    0  non-null values
770000000011    0  non-null values
770000000012    0  non-null values
770000000013    0  non-null values
770000000018    0  non-null values
770000000020    0  non-null values
770000000021    0  non-null values
770000000022    0  non-null values
770000000024    0  non-null values
770000000029    0  non-null values
770000000030    0  non-null values
770000000032    0  non-null values
770000000034    0  non-null values
770000000049    0  non-null values
dtypes: float64(16)

完全错误

编辑2

written a module为我重现了这个错误。

2 个答案:

答案 0 :(得分:1)

编辑: 是一个错误。我(Wes)在此修复了它:https://github.com/pydata/pandas/commit/aea7c4522bd7beffd0df80efee818873110609fa


结果是it's not a bug -

  

虽然pandas不会强制您使用已排序的日期索引,但如果日期未排序,则其中某些方法可能会出现意外或不正确的行为。所以请小心。

在数据库级别对日期进行排序为我解决了问题。

答案 1 :(得分:0)

我跑了你粘贴的片段,对我来说似乎很好。您使用的是哪个版本的pandas / numpy?你能发布所有/更多的数据吗?

In [26]: paste
{u'_id': u'770000000049',
 u'value': {u'timestamps': [datetime.datetime(2012, 7, 25, 10, 16, 1, 270000),
                            datetime.datetime(2012, 7, 25, 10, 18, 29, 745000),
                            datetime.datetime(2012, 7, 25, 10, 21, 54, 931000),
                            datetime.datetime(2012, 7, 25, 10, 23, 18, 896000)],
            u'values': [204.0, 16.788, 139.2, 116.004]}}
## -- End pasted text --
Out[26]: 
{u'_id': u'770000000049',
 u'value': {u'timestamps': [datetime.datetime(2012, 7, 25, 10, 16, 1, 270000),
   datetime.datetime(2012, 7, 25, 10, 18, 29, 745000),
   datetime.datetime(2012, 7, 25, 10, 21, 54, 931000),
   datetime.datetime(2012, 7, 25, 10, 23, 18, 896000)],
  u'values': [204.0, 16.788, 139.2, 116.004]}}

In [27]: data = [_]

In [28]: paste
df_data = dict()

for x in data:
    series = pandas.Series(x['value']['values'], index=x['value']['timestamps'])

    df_data[x['_id']] = series

df = pandas.DataFrame(df_data)
## -- End pasted text --

In [29]: print df['770000000049']
2012-07-25 10:16:01.270000    204.000
2012-07-25 10:18:29.745000     16.788
2012-07-25 10:21:54.931000    139.200
2012-07-25 10:23:18.896000    116.004
Name: 770000000049