这很难描述(因此模糊的标题)所以我只是给出一些我想要的示例数据。我有两个数据帧
set.seed(5)
df = data.frame(ID = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 100)),
increment = rep(seq(from = 1, to = 25, by = 1), length.out = 100),
Var1 = rnorm(100))
和
set.seed(23)
df2 = data.frame(ID = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 200)),
distance = rep(seq(from = 1, to = 50, by = 1), length.out = 200),
Var2 = rnorm(200))
它们基本上都是带有$increment
和$distance
的横断面,它们是测量横断面的距离的度量。较长的一次具有两倍的测量值,因为它具有更高的分辨率,但它代表相同的横断面。我想根据横断面的距离比例加入它们。我用以下代码计算了每个比例列:
df = ddply(df, "ID", transform, proportion = increment/max(increment))
和
df2 = ddply(df2, "ID", transform, proportion = distance/max(distance))
我想要的输出是加入两个数据帧(保持Var1
和Var2
)。由于df2
的分辨率较高,因此将其加入df
的唯一方法是将其加入,但我的实际数据并不像示例数据那样规则,因此我需要根据proportion
列,以便计算df2
中df
ID increment Var1 Var2
2006_001 1 -0.84085548 -0.1207349
2006_001 2 1.38435934 1.353328
2006_001 3 -1.25549186 1.052048
2006_001 4 0.07014277 0.3705596
为了尝试总结,我试图通过计算落在较低分辨率集的分辨率内的所有点的较高分辨率数据集的平均值来加入具有不同分辨率的数据集。
-edit尝试添加所需的输出 -
输出数据框的前几行如下所示:
{{1}}
答案 0 :(得分:1)
一种方法是通过一些整齐的包。
library(dplyr)
library(tidyr) # nest, unnest
library(purrr) # pmap
由于两个data.frames的维度不同,我发现将nest
数据放入列中会很不错。
df2 <- df2 %>%
group_by(ID) %>%
mutate(
proportion = (distance - min(distance)) / diff(range(distance))
) %>%
nest(.key = "dist")
df2
# # A tibble: 4 × 2
# ID dist
# <fctr> <list>
# 1 2006_01 <tibble [50 × 3]>
# 2 2006_02 <tibble [50 × 3]>
# 3 2006_03 <tibble [50 × 3]>
# 4 2006_04 <tibble [50 × 3]>
df3 <- df %>%
group_by(ID) %>%
mutate(
proportion = (increment - min(increment)) / diff(range(increment))
) %>%
nest(.key = "incr") %>%
left_join(df2, by = "ID") %>%
mutate(
incr = pmap(list(incr, dist),
function(a, b) {
zz <- tail(a$proportion, n = -1) -
(tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2
a$Var2 <- as.numeric(
by(b$Var2, cut(b$proportion, c(-1, zz, 2), labels = FALSE), mean)
)
a
})
)
现在已经在df2
中设置了相关内容,我们首先在df
中执行相同的操作,并将它们并排组合:
df3 <- df %>%
group_by(ID) %>%
mutate(
proportion = (increment - min(increment)) / diff(range(increment))
) %>%
nest(.key = "incr") %>%
left_join(df2, by = "ID")
df3
# # A tibble: 4 × 3
# ID incr dist
# <fctr> <list> <list>
# 1 2006_01 <tibble [25 × 3]> <tibble [50 × 3]>
# 2 2006_02 <tibble [25 × 3]> <tibble [50 × 3]>
# 3 2006_03 <tibble [25 × 3]> <tibble [50 × 3]>
# 4 2006_04 <tibble [25 × 3]> <tibble [50 × 3]>
请注意,这允许我们在通过x
加入时,将一个data.frame的y
行与另一个的ID
行相关联。
df3 %>%
mutate(
incr = pmap(list(incr, dist),
function(a, b) {
# offset between breaks, ...
breaks <- tail(a$proportion, n = -1) -
(tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2
# ... with bookends to ensure 100% membership
breaks <- c(-1, breaks, 2)
a$Var2 <- as.numeric(
by(b$Var2, cut(b$proportion, breaks), mean)
)
a
})
) %>%
select(ID, incr) %>%
unnest() %>%
select(-proportion)
# # A tibble: 100 × 4
# ID increment Var1 Var2
# <fctr> <dbl> <dbl> <dbl>
# 1 2006_01 1 -0.84085548 -0.12073489
# 2 2006_01 2 1.38435934 1.35332759
# 3 2006_01 3 -1.25549186 1.05204780
# 4 2006_01 4 0.07014277 0.37055960
# 5 2006_01 5 1.71144087 0.81060839
# 6 2006_01 6 -0.60290798 -0.41412345
# 7 2006_01 7 -0.47216639 0.09643082
# 8 2006_01 8 -0.63537131 -0.45411977
# 9 2006_01 9 -0.28577363 -0.48124606
# 10 2006_01 10 0.13810822 0.34763251
# # ... with 90 more rows