基于列值的变量分箱

时间:2017-08-14 23:16:55

标签: r

这很难描述(因此模糊的标题)所以我只是给出一些我想要的示例数据。我有两个数据帧

set.seed(5)    
df = data.frame(ID  = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 100)), 
            increment = rep(seq(from = 1, to = 25, by = 1), length.out = 100),
            Var1 = rnorm(100))

set.seed(23)
df2 = data.frame(ID  = sort(rep(c("2006_01", "2006_02", "2006_03", "2006_04"), length.out = 200)), 
             distance = rep(seq(from = 1, to = 50, by = 1), length.out = 200), 
             Var2 = rnorm(200))

它们基本上都是带有$increment$distance的横断面,它们是测量横断面的距离的度量。较长的一次具有两倍的测量值,因为它具有更高的分辨率,但它代表相同的横断面。我想根据横断面的距离比例加入它们。我用以下代码计算了每个比例列:

df = ddply(df, "ID", transform, proportion = increment/max(increment))

df2 = ddply(df2, "ID", transform, proportion = distance/max(distance))

我想要的输出是加入两个数据帧(保持Var1Var2)。由于df2的分辨率较高,因此将其加入df的唯一方法是将其加入,但我的实际数据并不像示例数据那样规则,因此我需要根据proportion列,以便计算df2df

中每组比例值之间的比例值的均值ID increment Var1 Var2 2006_001 1 -0.84085548 -0.1207349 2006_001 2 1.38435934 1.353328 2006_001 3 -1.25549186 1.052048 2006_001 4 0.07014277 0.3705596

为了尝试总结,我试图通过计算落在较低分辨率集的分辨率内的所有点的较高分辨率数据集的平均值来加入具有不同分辨率的数据集。

-edit尝试添加所需的输出 -

输出数据框的前几行如下所示:

{{1}}

1 个答案:

答案 0 :(得分:1)

一种方法是通过一些整齐的包。

library(dplyr)
library(tidyr) # nest, unnest
library(purrr) # pmap

由于两个data.frames的维度不同,我发现将nest数据放入列中会很不错。

df2 <- df2 %>%
  group_by(ID) %>%
  mutate(
    proportion = (distance - min(distance)) / diff(range(distance))
  ) %>%
  nest(.key = "dist")
df2
# # A tibble: 4 × 2
#        ID              dist
#    <fctr>            <list>
# 1 2006_01 <tibble [50 × 3]>
# 2 2006_02 <tibble [50 × 3]>
# 3 2006_03 <tibble [50 × 3]>
# 4 2006_04 <tibble [50 × 3]>

df3 <- df %>%
  group_by(ID) %>%
  mutate(
    proportion = (increment - min(increment)) / diff(range(increment))
  ) %>%
  nest(.key = "incr") %>%
  left_join(df2, by = "ID") %>%
  mutate(
    incr = pmap(list(incr, dist),
                function(a, b) {
                  zz <- tail(a$proportion, n = -1) -
                    (tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2
                  a$Var2 <- as.numeric(
                    by(b$Var2, cut(b$proportion, c(-1, zz, 2), labels = FALSE), mean)
                  )
                  a
                })
  )

现在已经在df2中设置了相关内容,我们首先在df中执行相同的操作,并将它们并排组合:

df3 <- df %>%
  group_by(ID) %>%
  mutate(
    proportion = (increment - min(increment)) / diff(range(increment))
  ) %>%
  nest(.key = "incr") %>%
  left_join(df2, by = "ID")
df3
# # A tibble: 4 × 3
#        ID              incr              dist
#    <fctr>            <list>            <list>
# 1 2006_01 <tibble [25 × 3]> <tibble [50 × 3]>
# 2 2006_02 <tibble [25 × 3]> <tibble [50 × 3]>
# 3 2006_03 <tibble [25 × 3]> <tibble [50 × 3]>
# 4 2006_04 <tibble [25 × 3]> <tibble [50 × 3]>

请注意,这允许我们在通过x加入时,将一个data.frame的y行与另一个的ID行相关联。

df3 %>%
  mutate(
    incr = pmap(list(incr, dist),
                function(a, b) {
                  # offset between breaks, ...
                  breaks <- tail(a$proportion, n = -1) -
                    (tail(a$proportion, n = -1) - head(a$proportion, n = -1)) / 2
                  # ... with bookends to ensure 100% membership
                  breaks <- c(-1, breaks, 2)
                  a$Var2 <- as.numeric(
                    by(b$Var2, cut(b$proportion, breaks), mean)
                  )
                  a
                })
  ) %>%
  select(ID, incr) %>%
  unnest() %>%
  select(-proportion)
# # A tibble: 100 × 4
#         ID increment        Var1        Var2
#     <fctr>     <dbl>       <dbl>       <dbl>
# 1  2006_01         1 -0.84085548 -0.12073489
# 2  2006_01         2  1.38435934  1.35332759
# 3  2006_01         3 -1.25549186  1.05204780
# 4  2006_01         4  0.07014277  0.37055960
# 5  2006_01         5  1.71144087  0.81060839
# 6  2006_01         6 -0.60290798 -0.41412345
# 7  2006_01         7 -0.47216639  0.09643082
# 8  2006_01         8 -0.63537131 -0.45411977
# 9  2006_01         9 -0.28577363 -0.48124606
# 10 2006_01        10  0.13810822  0.34763251
# # ... with 90 more rows