我希望通过添加名为Average
的新列来相互连接两个数据集。此列是Date
和Date - diff
之间的持续时间的平均值。我有两个数据集,第一个被称为data
,看起来像这样:
Date Weight diff Loc.nr
2013-01-24 1040 7 2
2013-01-31 1000 7 2
2013-01-19 500 4 9
2013-01-23 1040 4 9
2013-01-28 415 5 9
2013-01-31 650 3 9
另一个名为Rain.duration
,Duration
列中是当天下雨的小时数。此数据集如下所示:
Date Duration
2013-01-14 4.5
2013-01-15 0.0
2013-01-16 6.9
2013-01-17 0.0
2013-01-18 1.8
2013-01-19 2.1
2013-01-20 0.0
2013-01-21 0.0
2013-01-22 4.3
2013-01-23 0.0
2013-01-24 7.5
2013-01-25 4.7
2013-01-26 0.0
2013-01-27 0.7
2013-01-28 5.0
2013-01-29 0.0
2013-01-30 3.1
2013-01-31 2.8
我做了一个代码来执行此操作:
for(i in 1:nrow(data)) {
for(j in 1:nrow(Rain.duration)) {
if(data$Date[i] == Rain.duration$Date[j]) {
average <- as.array(Rain.duration$Duration[(j-(data$diff[i])):j])
j <- nrow(Rain.duration)
}
}
data$Average[i] <- mean(average)
}
此代码的问题在于,由于我的数据集的大小,运行需要3天。有更快的方法吗?
我的预期结果是:
Date Weight diff Loc.nr Average
2013-01-24 1040 7 2 1.96
2013-01-31 1000 7 2 2.98
2013-01-19 500 4 9 2.16
2013-01-23 1040 4 9 1.28
2013-01-28 415 5 9 2.98
2013-01-31 650 3 9 2.73
答案 0 :(得分:0)
这是一个dplyr解决方案:
library(dplyr)
# add row number as a new column just to make it easier to read
weather_with_rows <- Weather %>%
mutate(Rownum = row_number())
# write function to filter by row number, then return the average duration
getavgduration <- function(mydate, mydiff) {
myrow = weather_with_rows %>%
filter(Date == mydate) %>%
pluck("Rownum")
mystartrow = myrow -mydiff
myduration = weather_with_rows %>%
filter(
Rownum <= myrow
, Rownum >= mystartrow
)
mean(myduration$Duration)
}
# get the average duration for each Date/diff pair
averages <- data %>%
group_by(Date, Diff) %>%
summarize(Average = getavgduration(Date, Diff)) %>%
ungroup()
# join this back into the original data frame
# this step might not be necessary
# and might be a big drag on performance,
# depending on the size of your real data
data_with_avg_duration <- data %>%
left_join(averages, by = c('Date','Diff')
答案 1 :(得分:0)
这个旧问题还没有得到接受的答案,所以我觉得有必要发布一个替代解决方案,聚合在非等连接中。
OP已要求在Rain.duration
中给出的每个日期间隔的每日降雨时数表中data
计算降雨的平均持续时间。
library(data.table)
# make sure Date columns are of class Date
setDT(data)[, Date := as.Date(Date)]
setDT(Rain.duration)[, Date := as.Date(Date)]
# aggregate in a non-equi join and assign the result to a new column
data[, Average := Rain.duration[data[, .(upper = Date, lower = Date - diff)],
on = .(Date <= upper, Date >= lower),
mean(Duration), by = .EACHI]$V1][]
Date Weight diff Loc.nr Average 1: 2013-01-24 1040 7 2 1.962500 2: 2013-01-31 1000 7 2 2.975000 3: 2013-01-19 500 4 9 2.160000 4: 2013-01-23 1040 4 9 1.280000 5: 2013-01-28 415 5 9 2.983333 6: 2013-01-31 650 3 9 2.725000
关键部分是
Rain.duration[data[, .(upper = Date, lower = Date - diff)],
on = .(Date <= upper, Date >= lower),
mean(Duration), by = .EACHI]
Date Date V1 1: 2013-01-24 2013-01-17 1.962500 2: 2013-01-31 2013-01-24 2.975000 3: 2013-01-19 2013-01-15 2.160000 4: 2013-01-23 2013-01-19 1.280000 5: 2013-01-28 2013-01-23 2.983333 6: 2013-01-28 2013-01-23 2.983333 7: 2013-01-31 2013-01-28 2.725000
使用从data
派生的日期范围进行非等联接:
data[, .(upper = Date, lower = Date - diff)]
upper lower 1: 2013-01-24 2013-01-17 2: 2013-01-31 2013-01-24 3: 2013-01-19 2013-01-15 4: 2013-01-23 2013-01-19 5: 2013-01-28 2013-01-23 6: 2013-01-28 2013-01-23 7: 2013-01-31 2013-01-28
by = .EACHI
请求为每个日期间隔计算聚合mean(Duration)
即时,这样可以避免创建和复制临时子集。
请注意,即使Rain.duration
存在空白或无序,此解决方案也会给出正确的答案,因为它仅依赖于Date
而不是其他使用行号的解决方案。