如何将数据集与R中另一个数据集的两个日期之间的平均值相连接?

时间:2017-11-15 17:05:37

标签: r date merge dataset

我希望通过添加名为Average的新列来相互连接两个数据集。此列是DateDate - diff之间的持续时间的平均值。我有两个数据集,第一个被称为data,看起来像这样:

       Date   Weight   diff   Loc.nr  
2013-01-24     1040       7        2
2013-01-31     1000       7        2
2013-01-19      500       4        9
2013-01-23     1040       4        9
2013-01-28      415       5        9
2013-01-31      650       3        9

另一个名为Rain.durationDuration列中是当天下雨的小时数。此数据集如下所示:

      Date  Duration
2013-01-14       4.5
2013-01-15       0.0
2013-01-16       6.9
2013-01-17       0.0
2013-01-18       1.8
2013-01-19       2.1
2013-01-20       0.0
2013-01-21       0.0
2013-01-22       4.3
2013-01-23       0.0
2013-01-24       7.5
2013-01-25       4.7
2013-01-26       0.0
2013-01-27       0.7
2013-01-28       5.0
2013-01-29       0.0
2013-01-30       3.1
2013-01-31       2.8

我做了一个代码来执行此操作:

for(i in 1:nrow(data)) {
  for(j in 1:nrow(Rain.duration)) {
    if(data$Date[i] == Rain.duration$Date[j]) {
      average <- as.array(Rain.duration$Duration[(j-(data$diff[i])):j])

      j <- nrow(Rain.duration)
    }
  }
  data$Average[i] <- mean(average)
}

此代码的问题在于,由于我的数据集的大小,运行需要3天。有更快的方法吗?

我的预期结果是:

       Date   Weight   diff   Loc.nr   Average
2013-01-24     1040       7        2      1.96
2013-01-31     1000       7        2      2.98
2013-01-19      500       4        9      2.16
2013-01-23     1040       4        9      1.28
2013-01-28      415       5        9      2.98
2013-01-31      650       3        9      2.73

2 个答案:

答案 0 :(得分:0)

这是一个dplyr解决方案:

library(dplyr)

# add row number as a new column just to make it easier to read
weather_with_rows  <- Weather %>%
    mutate(Rownum = row_number())

# write function to filter by row number, then return the average duration
getavgduration  <- function(mydate, mydiff) {

    myrow = weather_with_rows %>%
         filter(Date == mydate) %>%
         pluck("Rownum")

    mystartrow = myrow -mydiff

    myduration = weather_with_rows %>%
        filter(
              Rownum <= myrow
            , Rownum >= mystartrow
        )

    mean(myduration$Duration)

}

# get the average duration for each Date/diff pair
averages  <- data %>%
    group_by(Date, Diff) %>%
    summarize(Average = getavgduration(Date, Diff)) %>%
    ungroup()


# join this back into the original data frame
#    this step might not be necessary 
#    and might be a big drag on performance, 
#    depending on the size of your real data
data_with_avg_duration  <- data %>%
    left_join(averages, by = c('Date','Diff')

答案 1 :(得分:0)

这个旧问题还没有得到接受的答案,所以我觉得有必要发布一个替代解决方案,聚合在非等连接中。

OP已要求在Rain.duration中给出的每个日期间隔的每日降雨时数表中data计算降雨的平均持续时间。

library(data.table)
# make sure Date columns are of class Date
setDT(data)[, Date := as.Date(Date)]
setDT(Rain.duration)[, Date := as.Date(Date)]
# aggregate in a non-equi join and assign the result to a new column
data[,  Average := Rain.duration[data[, .(upper = Date, lower = Date - diff)], 
            on = .(Date <= upper, Date >= lower), 
            mean(Duration), by  = .EACHI]$V1][]
         Date Weight diff Loc.nr  Average
1: 2013-01-24   1040    7      2 1.962500
2: 2013-01-31   1000    7      2 2.975000
3: 2013-01-19    500    4      9 2.160000
4: 2013-01-23   1040    4      9 1.280000
5: 2013-01-28    415    5      9 2.983333
6: 2013-01-31    650    3      9 2.725000

关键部分是

Rain.duration[data[, .(upper = Date, lower = Date - diff)], 
              on = .(Date <= upper, Date >= lower), 
              mean(Duration), by  = .EACHI]
         Date       Date       V1
1: 2013-01-24 2013-01-17 1.962500
2: 2013-01-31 2013-01-24 2.975000
3: 2013-01-19 2013-01-15 2.160000
4: 2013-01-23 2013-01-19 1.280000
5: 2013-01-28 2013-01-23 2.983333
6: 2013-01-28 2013-01-23 2.983333
7: 2013-01-31 2013-01-28 2.725000

使用从data派生的日期范围进行非等联接

data[, .(upper = Date, lower = Date - diff)]
        upper      lower
1: 2013-01-24 2013-01-17
2: 2013-01-31 2013-01-24
3: 2013-01-19 2013-01-15
4: 2013-01-23 2013-01-19
5: 2013-01-28 2013-01-23
6: 2013-01-28 2013-01-23
7: 2013-01-31 2013-01-28

by = .EACHI请求为每个日期间隔计算聚合mean(Duration) 即时,这样可以避免创建和复制临时子集。

请注意,即使Rain.duration存在空白或无序,此解决方案也会给出正确的答案,因为它仅依赖于Date而不是其他使用行号的解决方案。