Question

我正在使用python来尝试一些简单的时间序列分析。我每天[day]有一个人[cal2]的卡路里摄入数据。我从Stata .dta文件中获取数据。

我执行以下操作：

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import read_csv
from matplotlib.pylab import rcParams


d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, index = 'day', convert_dates=True)
print(d.dtypes)
print(d.shape)
print(d.index)
print(d.head)

plt.plot(d)

这就是数据的样子：

0   2002-01-10  3668.433350
1   2002-01-11  3652.249756
2   2002-01-12  3647.866211
3   2002-01-13  3646.684326
4   2002-01-14  3661.941406
5   2002-01-15  3656.951660

印刷品揭示了以下内容：

day     datetime64[ns]
cal2           float32
dtype: object
(251, 2)
Int64Index([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,
            ...
            241, 242, 243, 244, 245, 246, 247, 248, 249, 250],
           dtype='int64', length=251)

这就是问题所在。数据应标识为dtype='datatime64[ns]' - 即标识为时间序列。它显然没有。为什么不呢？

Answer 1

所提供的代码，数据和所示的类型之间存在差异。这是因为与cal2的类型无关，index = 'day'参数 pd.read_stata()中的值应始终呈现day的索引，即使不是所需的类型。

话虽如此，问题可以重现如下。

首先，在Stata中创建数据集：

clear
input double day float cal2
15350  3668.433
15351   3652.25
15352  3647.866
15353  3646.684
15354 3661.9414
15355  3656.952
end
format %td day

save time_series_calories

describe

Contains data from time_series_calories.dta
  obs:             6                          
 vars:             2                          
 size:            72                          
----------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
----------------------------------------------------------------------------------------------------
day             double  %td                   
cal2            float   %9.0g                 
----------------------------------------------------------------------------------------------------
Sorted by:

第二，在Pandas中加载数据：

import pandas as pd
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, convert_dates=True)

print(d.head)
         day         cal2
0 2002-01-10  3668.433350
1 2002-01-11  3652.249756
2 2002-01-12  3647.866211
3 2002-01-13  3646.684326
4 2002-01-14  3661.941406
5 2002-01-15  3656.951660

print(d.dtypes)
day     datetime64[ns]
cal2           float32
dtype: object

print(d.shape)
(6, 2)

print(d.index)
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

要根据需要更改索引，可以使用pd.set_index()：

d = d.set_index('day')

print(d.head)

                   cal2
day                    
2002-01-10  3668.433350
2002-01-11  3652.249756
2002-01-12  3647.866211
2002-01-13  3646.684326
2002-01-14  3661.941406
2002-01-15  3656.951660

print(d.index)
DatetimeIndex(['2002-01-10', '2002-01-11', '2002-01-12', '2002-01-13',
               '2002-01-14', '2002-01-15'],
              dtype='datetime64[ns]', name='day', freq=None)

如果day是Stata数据集中的字符串，则可以执行以下操作：

d['day'] = pd.to_datetime(d.day)
d = d.set_index('day')

为什么我的数据不被识别为时间序列？

1 个答案: