我以以下格式记录了数据,
输入
name year value
Afghanistan 1800 68
Albania 1800 23
Algeria 1800 54
Afghanistan 1801 59
Albania 1801 38
Algeria 1801 72
---
Afghanistan 2040 142
Albania 2040 165
Algeria 2040 120
我想用1个月重新采样记录 1800年至2040年年的所有数据,并完全使用如下所示的格式,
预期输出
name year value
Afghanistan Jan 1800 5.6667
Afghanistan Feb 1800 11.3333
Afghanistan Mar 1800 17.0000
Afghanistan Apr 1800 22.6667
Afghanistan May 1800 28.3333
Afghanistan Jun 1800 34.0000
Afghanistan Jul 1800 39.6667
Afghanistan Aug 1800 45.3333
Afghanistan Sep 1800 51.0000
Afghanistan Oct 1800 56.6667
Afghanistan Nov 1800 62.3333
Afghanistan Dec 1800 68.0000
Albania Jan 1800 1.9167
Albania Feb 1800 3.8333
Albania Mar 1800 5.7500
Albania Apr 1800 7.6667
Albania May 1800 9.5833
Albania Jun 1800 11.5000
Albania Jul 1800 13.4167
Albania Aug 1800 15.3333
Albania Sep 1800 17.2500
Albania Oct 1800 19.1667
Albania Nov 1800 21.0833
Albania Dec 1800 23.0000
Algeria Jan 1800 4.5000
Algeria Feb 1800 9.0000
Algeria Mar 1800 13.5000
Algeria Apr 1800 18.0000
Algeria May 1800 22.5000
Algeria Jun 1800 27.0000
Algeria Jul 1800 31.5000
Algeria Aug 1800 36.0000
Algeria Sep 1800 40.5000
Algeria Oct 1800 45.0000
Algeria Nov 1800 49.5000
Algeria Dec 1800 54.000
我希望我的数据在所有年份(即1800年-2040年)都保持上述状态。 值列已插入。 NB:我的模型将接受几个月的缩写。
我最接近的试验如下,但未产生预期结果。
data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)
name year value
Afghanistan 1800-01-01 00:00:00 68
Albania 1800-01-01 00:00:00 23
Algeria 1800-01-01 00:00:00 54
resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))
resampled.head(3)
name year name value
Afghanistan 1800-01-31 00:00:00 NaN NaN
1800-02-28 00:00:00 NaN NaN
1800-03-31 00:00:00 NaN NaN
您的想法会在这里救我。
答案 0 :(得分:2)
这是一种tidyverse
方法,插值部分也需要zoo
包。
library(dplyr)
library(tidyr)
library(zoo)
df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
year = rep(seq(1800,1802), times = 2),
value = rep(seq(3), times = 2),
stringsAsFactors = FALSE)
df2 <- df %>%
# make a grid of all country/year/month possibilities within the years in df
tidyr::expand(year, month = seq(12)) %>%
# join that to the original data frame to add back the values
left_join(., df) %>%
# put the result in chronological order
arrange(country, year, month) %>%
# group by country so the interpolation stays within those sets
group_by(country) %>%
# make a version of value that is NA except for Dec, then use na.approx to replace
# the NAs with linearly interpolated values
mutate(value_i = ifelse(month == 12, value, NA),
value_i = zoo::na.approx(value_i, na.rm = FALSE))
请注意,得到的列value_i
为NA
,直到第一个有效观测值(第一年的12月)。因此,df2
的尾部是这样。
> tail(df2)
# A tibble: 6 x 5
# Groups: country [1]
year month country value value_i
<int> <int> <chr> <int> <dbl>
1 1802 7 Algeria 3 2.58
2 1802 8 Algeria 3 2.67
3 1802 9 Algeria 3 2.75
4 1802 10 Algeria 3 2.83
5 1802 11 Algeria 3 2.92
6 1802 12 Algeria 3 3
如果要替换这些领先的NA,则必须进行线性外推,而可以使用na.spline
中的zoo
进行。而且,如果您希望在一月份而不是十二月获得观测值,而不是领先的NA,则只需将倒数第二行的相关位更改为month == 1
。
答案 1 :(得分:2)
除了用于imputeTS
软件包的内推和外推之外,在此解决方案中我仅使用基数R。
res <- do.call(rbind, by(dat, dat$name, function(x) {
## expanding years to year-months
ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
yr <- x$year[i]
data.frame(name=x$name[1],
year=seq.Date(as.Date(ISOdate(yr, 1, 1)),
as.Date(ISOdate(yr, 12, 31)), "month"),
value=x$value[i])
}))
## set values to NA except 01-01s
ex[!grepl("01-01", ex$year), "value"] <- NA
transform(ex,
## impute values linearly
value=imputeTS::na_interpolation(ex$value),
## format dates for desired output
year=strftime(ex$year, format="%b-%Y")
)
}))
res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ] ## sample rows
# name year value
# A.1 A Jan-1800 71.00000
# A.2 A Feb-1800 73.08333
# A.3 A Mar-1800 75.16667
# A.13 A Jan-1801 96.00000
# A.14 A Feb-1801 93.75000
# A.15 A Mar-1801 91.50000
# B.1 B Jan-1800 87.00000
# B.2 B Feb-1800 83.08333
# B.3 B Mar-1800 79.16667
# B.13 B Jan-1801 40.00000
# B.14 B Feb-1801 40.50000
# B.15 B Mar-1801 41.00000
# C.1 C Jan-1800 47.00000
# C.2 C Feb-1800 49.00000
# C.3 C Mar-1800 51.00000
# C.4 C Apr-1800 53.00000
# C.13 C Jan-1801 71.00000
# C.14 C Feb-1801 72.83333
# C.15 C Mar-1801 74.66667
数据
set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
year=1800:1810),
value=sample(23:120, 33, replace=TRUE))