我正在使用python来尝试一些简单的时间序列分析。我每天[day]
有一个人[cal2]
的卡路里摄入数据。我从Stata .dta文件中获取数据。
我执行以下操作:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import read_csv
from matplotlib.pylab import rcParams
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, index = 'day', convert_dates=True)
print(d.dtypes)
print(d.shape)
print(d.index)
print(d.head)
plt.plot(d)
这就是数据的样子:
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
印刷品揭示了以下内容:
day datetime64[ns]
cal2 float32
dtype: object
(251, 2)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
241, 242, 243, 244, 245, 246, 247, 248, 249, 250],
dtype='int64', length=251)
这就是问题所在。数据应标识为dtype='datatime64[ns]'
- 即标识为时间序列。它显然没有。为什么不呢?
答案 0 :(得分:1)
所提供的代码,数据和所示的类型之间存在差异。
这是因为与cal2
的类型无关,index = 'day'
参数
pd.read_stata()
中的值应始终呈现day
的索引,即使不是
所需的类型。
话虽如此,问题可以重现如下。
首先,在Stata中创建数据集:
clear
input double day float cal2
15350 3668.433
15351 3652.25
15352 3647.866
15353 3646.684
15354 3661.9414
15355 3656.952
end
format %td day
save time_series_calories
describe
Contains data from time_series_calories.dta
obs: 6
vars: 2
size: 72
----------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------
day double %td
cal2 float %9.0g
----------------------------------------------------------------------------------------------------
Sorted by:
第二,在Pandas中加载数据:
import pandas as pd
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, convert_dates=True)
print(d.head)
day cal2
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
print(d.dtypes)
day datetime64[ns]
cal2 float32
dtype: object
print(d.shape)
(6, 2)
print(d.index)
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
要根据需要更改索引,可以使用pd.set_index()
:
d = d.set_index('day')
print(d.head)
cal2
day
2002-01-10 3668.433350
2002-01-11 3652.249756
2002-01-12 3647.866211
2002-01-13 3646.684326
2002-01-14 3661.941406
2002-01-15 3656.951660
print(d.index)
DatetimeIndex(['2002-01-10', '2002-01-11', '2002-01-12', '2002-01-13',
'2002-01-14', '2002-01-15'],
dtype='datetime64[ns]', name='day', freq=None)
如果day
是Stata数据集中的字符串,则可以执行以下操作:
d['day'] = pd.to_datetime(d.day)
d = d.set_index('day')