每月对数据重新采样R或Python

时间:2020-03-07 17:59:51

标签: python r pandas data.table

我以以下格式记录了数据,

输入

name             year            value 
Afghanistan      1800            68
Albania          1800            23
Algeria          1800            54

Afghanistan      1801            59
Albania          1801            38
Algeria          1801            72

---
Afghanistan      2040            142
Albania          2040            165
Algeria          2040            120

我想用1个月重新采样记录 1800年至2040年年的所有数据,并完全使用如下所示的格式,

预期输出

name             year            value 
Afghanistan      Jan 1800        5.6667  
Afghanistan      Feb 1800        11.3333    
Afghanistan      Mar 1800        17.0000    
Afghanistan      Apr 1800        22.6667    
Afghanistan      May 1800        28.3333    
Afghanistan      Jun 1800        34.0000    
Afghanistan      Jul 1800        39.6667    
Afghanistan      Aug 1800        45.3333    
Afghanistan      Sep 1800        51.0000
Afghanistan      Oct 1800        56.6667
Afghanistan      Nov 1800        62.3333
Afghanistan      Dec 1800        68.0000      
Albania          Jan 1800        1.9167
Albania          Feb 1800        3.8333
Albania          Mar 1800        5.7500
Albania          Apr 1800        7.6667
Albania          May 1800        9.5833
Albania          Jun 1800        11.5000
Albania          Jul 1800        13.4167
Albania          Aug 1800        15.3333
Albania          Sep 1800        17.2500
Albania          Oct 1800        19.1667
Albania          Nov 1800        21.0833
Albania          Dec 1800        23.0000
Algeria          Jan 1800        4.5000
Algeria          Feb 1800        9.0000
Algeria          Mar 1800        13.5000
Algeria          Apr 1800        18.0000
Algeria          May 1800        22.5000
Algeria          Jun 1800        27.0000
Algeria          Jul 1800        31.5000
Algeria          Aug 1800        36.0000
Algeria          Sep 1800        40.5000
Algeria          Oct 1800        45.0000
Algeria          Nov 1800        49.5000
Algeria          Dec 1800        54.000

我希望我的数据在所有年份(即1800年-2040年)都保持上述状态。 值列已插入。 NB:我的模型将接受几个月的缩写。

我最接近的试验如下,但未产生预期结果。

data['year'] = pd.to_datetime(data.year, format='%Y')
data.head(3)   
    name                year                value
Afghanistan         1800-01-01 00:00:00     68
Albania             1800-01-01 00:00:00     23
Algeria             1800-01-01 00:00:00     54  

resampled = (data.groupby(['name']).apply(lambda x: x.set_index('year').resample('M').interpolate()))  

resampled.head(3)

name        year                 name  value                   
Afghanistan 1800-01-31 00:00:00  NaN    NaN
            1800-02-28 00:00:00  NaN    NaN
            1800-03-31 00:00:00  NaN    NaN

您的想法会在这里救我。

2 个答案:

答案 0 :(得分:2)

这是一种tidyverse方法,插值部分也需要zoo包。

library(dplyr)
library(tidyr)
library(zoo)

df <- data.frame(country = rep(c("Afghanistan", "Algeria"), each = 3),
                 year = rep(seq(1800,1802), times = 2),
                 value = rep(seq(3), times = 2),
                 stringsAsFactors = FALSE)

df2 <- df %>%
    # make a grid of all country/year/month possibilities within the years in df
    tidyr::expand(year, month = seq(12)) %>%
    # join that to the original data frame to add back the values
    left_join(., df) %>%
    # put the result in chronological order
    arrange(country, year, month) %>%
    # group by country so the interpolation stays within those sets
    group_by(country) %>%
    # make a version of value that is NA except for Dec, then use na.approx to replace
    # the NAs with linearly interpolated values
    mutate(value_i = ifelse(month == 12, value, NA),
           value_i = zoo::na.approx(value_i, na.rm = FALSE))

请注意,得到的列value_iNA,直到第一个有效观测值(第一年的12月)。因此,df2的尾部是这样。

> tail(df2)
# A tibble: 6 x 5
# Groups:   country [1]
   year month country value value_i
  <int> <int> <chr>   <int>   <dbl>
1  1802     7 Algeria     3    2.58
2  1802     8 Algeria     3    2.67
3  1802     9 Algeria     3    2.75
4  1802    10 Algeria     3    2.83
5  1802    11 Algeria     3    2.92
6  1802    12 Algeria     3    3 

如果要替换这些领先的NA,则必须进行线性外推,而可以使用na.spline中的zoo进行。而且,如果您希望在一月份而不是十二月获得观测值,而不是领先的NA,则只需将倒数第二行的相关位更改为month == 1

答案 1 :(得分:2)

除了用于imputeTS软件包的内推和外推之外,在此解决方案中我仅使用基数R。

res <- do.call(rbind, by(dat, dat$name, function(x) {
  ## expanding years to year-months
  ex <- do.call(rbind, lapply(1:nrow(x), function(i) {
    yr <- x$year[i]
    data.frame(name=x$name[1],
               year=seq.Date(as.Date(ISOdate(yr, 1, 1)), 
                             as.Date(ISOdate(yr, 12, 31)), "month"),
               value=x$value[i])
  }))
  ## set values to NA except 01-01s
  ex[!grepl("01-01", ex$year), "value"] <- NA
  transform(ex,
            ## impute values linearly
            value=imputeTS::na_interpolation(ex$value),
            ## format dates for desired output
            year=strftime(ex$year, format="%b-%Y")
            )
}))

结果

res[c(1:3, 13:15, 133:135, 145:147, 265:268, 277:279), ]  ## sample rows
#      name     year    value
# A.1     A Jan-1800 71.00000
# A.2     A Feb-1800 73.08333
# A.3     A Mar-1800 75.16667
# A.13    A Jan-1801 96.00000
# A.14    A Feb-1801 93.75000
# A.15    A Mar-1801 91.50000
# B.1     B Jan-1800 87.00000
# B.2     B Feb-1800 83.08333
# B.3     B Mar-1800 79.16667
# B.13    B Jan-1801 40.00000
# B.14    B Feb-1801 40.50000
# B.15    B Mar-1801 41.00000
# C.1     C Jan-1800 47.00000
# C.2     C Feb-1800 49.00000
# C.3     C Mar-1800 51.00000
# C.4     C Apr-1800 53.00000
# C.13    C Jan-1801 71.00000
# C.14    C Feb-1801 72.83333
# C.15    C Mar-1801 74.66667

数据

set.seed(42)
dat <- transform(expand.grid(name=LETTERS[1:3],
                             year=1800:1810),
                 value=sample(23:120, 33, replace=TRUE))