有没有办法估算熊猫时间序列的周期?对于R,xts
对象有一个名为periodicity
的方法,它正好用于此目的。有没有实现的方法呢?
例如,我们可以推断出没有指定频率的时间序列的频率吗?
import pandas.io.data as web
aapl = web.get_data_yahoo("AAPL")
<class 'pandas.tseries.index.DatetimeIndex'>
[2010-01-04 00:00:00, ..., 2013-12-19 00:00:00]
Length: 999, Freq: None, Timezone: None
这个系列的频率可以合理地近似为每日。
更新
我认为显示R的周期性方法实现的源代码可能会有所帮助。
function (x, ...)
{
if (timeBased(x) || !is.xts(x))
x <- try.xts(x, error = "'x' needs to be timeBased or xtsible")
p <- median(diff(.index(x)))
if (is.na(p))
stop("can not calculate periodicity of 1 observation")
units <- "days"
scale <- "yearly"
label <- "year"
if (p < 60) {
units <- "secs"
scale <- "seconds"
label <- "second"
}
else if (p < 3600) {
units <- "mins"
scale <- "minute"
label <- "minute"
p <- p/60L
}
else if (p < 86400) {
units <- "hours"
scale <- "hourly"
label <- "hour"
}
else if (p == 86400) {
scale <- "daily"
label <- "day"
}
else if (p <= 604800) {
scale <- "weekly"
label <- "week"
}
else if (p <= 2678400) {
scale <- "monthly"
label <- "month"
}
else if (p <= 7948800) {
scale <- "quarterly"
label <- "quarter"
}
structure(list(difftime = structure(p, units = units, class = "difftime"),
frequency = p, start = start(x), end = end(x), units = units,
scale = scale, label = label), class = "periodicity")
}
我认为这条线是关键,我不太明白
p <- median(diff(.index(x)))
答案 0 :(得分:5)
这个时间序列会跳过周末(和假期),所以它实际上并没有开始的每日频率。您可以使用asfreq
将其上采样为具有每日频率的时间序列,但是:
aapl = aapl.asfreq('D', method='ffill')
这样做会将最后观察到的值向前传播到缺少值的日期。
请注意,Pandas也有工作日频率,因此也可以使用以下方式将其上传到工作日:
aapl = aapl.asfreq('B', method='ffill')
如果您希望自动执行以天为单位推断中位频率的过程,那么您可以这样做:
import pandas as pd
import numpy as np
import pandas.io.data as web
aapl = web.get_data_yahoo("AAPL")
f = np.median(np.diff(aapl.index.values))
days = f.astype('timedelta64[D]').item().days
aapl = aapl.asfreq('{}D'.format(days), method='ffill')
print(aapl)
此代码需要测试,但也许它接近您发布的R代码:
import pandas as pd
import numpy as np
import pandas.io.data as web
def infer_freq(ts):
med = np.median(np.diff(ts.index.values))
seconds = int(med.astype('timedelta64[s]').item().total_seconds())
if seconds < 60:
freq = '{}s'.format(seconds)
elif seconds < 3600:
freq = '{}T'.format(seconds//60)
elif seconds < 86400:
freq = '{}H'.format(seconds//3600)
elif seconds < 604800:
freq = '{}D'.format(seconds//86400)
elif seconds < 2678400:
freq = '{}W'.format(seconds//604800)
elif seconds < 7948800:
freq = '{}M'.format(seconds//2678400)
else:
freq = '{}Q'.format(seconds//7948800)
return ts.asfreq(freq, method='ffill')
aapl = web.get_data_yahoo("AAPL")
print(infer_freq(aapl))
答案 1 :(得分:3)
我不知道频率,我能想出的唯一有意义的衡量标准是时间delta,例如天数:
>>> import numpy as np
>>> idx = aapl.index.values
>>> (np.roll(idx, -1) - idx)[:-1].mean()/np.timedelta64(1, 'D')
1.4478957915831596
或小时:
>>> (np.roll(idx, -1) - idx)[:-1].mean()/np.timedelta64(1, 'h')
34.749498997995836
与更多 pandorable 表达相同,对@DSM赞不绝口:
>>> aapl.index.to_series().diff().mean() / (60*60*10**9)
34.749498997995993
肯定的中位数是24小时,因为大多数日子都在列表中:
>>> aapl.index.to_series().diff().median() / (60*60*10**9)
24.0