计算df2的几列之间的平均值,该平均值可能会根据df1的变量“ var1”而有所不同,并将该值添加到df1中的新变量中

时间:2019-05-17 18:00:08

标签: r dplyr tidyverse

我有一个数据框df1,该数据框总结了不同时间段不同鱼类的深度。

另一方面,我有df2总结了从地表到39米深度的时间间隔(每三小时)的电流强度,间隔为8​​米(m0-7,{{ 1}},m8-15m16-23m24-31)放在特定位置。例如:

m32-39

我想在df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1)) df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC") > df1 Datetime Site Ind Depth 1 2016-08-01 15:34:07 BD 16 5.3 2 2016-08-01 16:25:16 HG 17 24.0 3 2016-08-01 17:29:16 BD 19 36.4 4 2016-08-01 18:33:16 BD 16 42.0 5 2016-08-01 20:54:16 BD 17 NA 6 2016-08-01 22:48:16 BD 16 22.1 df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7)) df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC") colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39") > df2 Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39 1 2016-08-01 12:00:00 BD 2.75 3.00 2.75 3.25 3.00 2 2016-08-01 15:00:00 BD 4.00 4.00 4.00 3.00 4.00 3 2016-08-01 18:00:00 BD 6.75 4.75 5.75 6.50 4.75 4 2016-08-01 21:00:00 BD 2.25 3.00 2.25 2.75 3.00 5 2016-08-02 00:00:00 BD 4.30 2.10 1.40 3.40 1.70 中创建一个变量,该变量反映鱼类不愿钓鱼的深层的平均电流。例如,如果鱼在20米深处,对应于df1层,我想知道m16-23m0-7m8-15m24-31

注1:如果我的鱼的深度超过39米,我认为它好像是在最深的一层(m32-39)。 m32-39第4行中的一个示例。

注2:由于当前记录是每三个小时,所以df1中指示的每小时代表一个多半小时和一个小时半。也就是说,df2$Datetime中位于df2处的电流强度反映了21:00:0019:30:00之间的电流。其余时间也一样。

我希望这样:

22:30:00

有人知道怎么做吗?

3 个答案:

答案 0 :(得分:2)

我将分两个步骤进行处理:

  1. 为df2中的每个Datetime,Site和Depth制作一个具有avg_speed_elsewhere的查找表。
  2. 加入df1。

这是一个查找表:

library(tidyverse)
df2_long <- df2 %>%
  gather(depth_rng, speed, `m0-7`:`m32-39`) %>%
  separate(depth_rng, c("min_depth", "max_depth")) %>%
  mutate_at(vars(matches("depth")), parse_number) %>%
  # EDIT -- added to make deep category cover >39 too
  mutate(max_depth = if_else(max_depth == 39, 10000, max_depth)) %>%
  group_by(Datetime, Site) %>%
  # Avg Speed elsewhere is the sum of all speeds, minus this speed, all divided by 4.
  mutate(avg_speed_elsewhere = (sum(speed) - speed) / 4)

> df2_long
# A tibble: 25 x 6
# Groups:   Datetime, Site [5]
   Datetime            Site  min_depth max_depth speed avg_speed_elsewhere
   <dttm>              <fct>     <dbl>     <dbl> <dbl>               <dbl>
 1 2016-08-18 12:00:00 BD            0         7  2.75                3   
 2 2016-08-18 15:00:00 BD            0         7  4                   3.75
 3 2016-08-18 18:00:00 BD            0         7  6.75                5.44
 4 2016-08-18 21:00:00 BD            0         7  2.25                2.75
 5 2016-08-19 00:00:00 BD            0         7  4.3                 2.15
 6 2016-08-18 12:00:00 BD            8        15  3                   2.94
 7 2016-08-18 15:00:00 BD            8        15  4                   3.75
 8 2016-08-18 18:00:00 BD            8        15  4.75                5.94
 9 2016-08-18 21:00:00 BD            8        15  3                   2.56
10 2016-08-19 00:00:00 BD            8        15  2.1                 2.7 
# ... with 15 more rows

我希望这可以,但是您提供的数据不会重叠,因此我不确定:

df1 %>%
  # EDIT - replaced floor_date with round_date
  mutate(Datetime_3hr = lubridate::round_date(Datetime, "3 hour")) %>%
  left_join(df2_long, by = c("Site", "Datetime_3hr" = "Datetime")) %>%
  filter(Depth >= min_depth & Depth < max_depth + 1 | is.na(Depth))

答案 1 :(得分:1)

使用data.table,您可以在两个数据库之间进行滚动连接,以使深度变量与当前变量相关联,即使时间不匹配也是如此。滚动联接的作用是将一个表与另一个表关联起来,时间最接近(根据您选择的选项)。我更改了一些数据,以使日期匹配

library(data.table)

df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")

df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")

setDT(df1)
setDT(df2)

setkey(df1, Site, Datetime)
setkey(df2, Site, Datetime)

df_merge = df2[df1, roll = Inf]

然后我使用dplyr的case_when计算其他深度的电流

library(dplyr)

df_merge[, current_elsewhere := case_when(
  is.na(Depth) ~ NA_real_,
  Depth < 7 ~ (`m8-15` + `m16-23` + `m24-31` + `m32-39`)/4,
  Depth < 15 ~ (`m0-7` + `m16-23` + `m24-31` + `m32-39`)/4,
  Depth < 23 ~ (`m0-7` + `m8-15` + `m24-31` + `m32-39`)/4,
  Depth < 31 ~ (`m0-7` + `m8-15` + `m16-23` + `m32-39`)/4,
  T ~ (`m0-7` + `m8-15` + `m16-23` + `m24-31`)/4)]

df_merge
              Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39 Ind Depth current_elsewhere
1: 2016-08-01 15:34:07   BD 4.00  4.00   4.00   3.00   4.00  16   5.3            3.7500
2: 2016-08-01 17:29:16   BD 4.00  4.00   4.00   3.00   4.00  19  36.4            3.7500
3: 2016-08-01 18:33:16   BD 6.75  4.75   5.75   6.50   4.75  16  42.0            5.9375
4: 2016-08-01 20:54:16   BD 6.75  4.75   5.75   6.50   4.75  17    NA                NA
5: 2016-08-01 22:48:16   BD 2.25  3.00   2.25   2.75   3.00  16  22.1            2.7500
6: 2016-08-01 16:25:16   HG   NA    NA     NA     NA     NA  17  24.0                NA

答案 2 :(得分:1)

这个问题包括有趣的挑战:

  1. OP正在请求“部分反联接” ,即,OP希望在df2Datetime的{​​{1}}中汇总当前数据匹配,但深度层
  2. 当前数据Site在查找表中给出,其中每个值都与深度范围(深度层)和时间范围相关联3个小时。因此,需要将df2中测得的DepthDatetime映射到各自的范围。

我尝试了不同的方法,但最终得到了下面的一种方法,该方法没有对聚合函数进行假设。因此,df1可以直接调用。

mean()
library(data.table)
library(magrittr)

# reshape df2 from wide to long format
currents <- melt(setDT(df2), id.vars = c("Datetime", "Site"),
                 variable.name = "layer", value.name = "current")

# create columns to join on
labels <- names(df2) %>% stringr::str_subset("^m")
breaks <- c(seq(0, 32, 8), Inf)
setDT(df1)[, layer := cut(Depth, breaks = breaks, labels = labels)]
df1[, current.dt := df2[df1, on = .(Site, Datetime), 
                      roll = "nearest", x.Datetime]]

# "partial anti-join" to compute mean of other layers
currents_other_layers <- 
  currents[df1, on = .(Site, Datetime = current.dt)][
    layer != i.layer, mean(current), by = .(i.Datetime, Site)]

# append result column
df1[currents_other_layers, on = .(Site, Datetime = i.Datetime), current.mean := i.V1]
df1

这重现了OP的预期结果。

说明

Datetime Site Ind Depth layer current.dt current.mean 1: 2016-08-01 15:34:07 BD 16 5.3 m0-7 2016-08-01 15:00:00 3.7500 2: 2016-08-01 16:25:16 HG 17 24.0 m16-23 <NA> NA 3: 2016-08-01 17:29:16 BD 19 36.4 m32-39 2016-08-01 18:00:00 5.9375 4: 2016-08-01 18:33:16 BD 16 42.0 m32-39 2016-08-01 18:00:00 5.9375 5: 2016-08-01 20:54:16 BD 17 NA <NA> 2016-08-01 21:00:00 NA 6: 2016-08-01 22:48:16 BD 16 22.1 m16-23 2016-08-02 00:00:00 2.8750 从宽格式改成长格式。这样可以在df2列上加入/反加入。

layer
currents

现在, Datetime Site layer current 1: 2016-08-01 12:00:00 BD m0-7 2.75 2: 2016-08-01 15:00:00 BD m0-7 4.00 3: 2016-08-01 18:00:00 BD m0-7 6.75 4: 2016-08-01 21:00:00 BD m0-7 2.25 5: 2016-08-02 00:00:00 BD m0-7 4.30 6: 2016-08-01 12:00:00 BD m8-15 3.00 7: 2016-08-01 15:00:00 BD m8-15 4.00 8: 2016-08-01 18:00:00 BD m8-15 4.75 9: 2016-08-01 21:00:00 BD m8-15 3.00 10: 2016-08-02 00:00:00 BD m8-15 2.10 11: 2016-08-01 12:00:00 BD m16-23 2.75 12: 2016-08-01 15:00:00 BD m16-23 4.00 13: 2016-08-01 18:00:00 BD m16-23 5.75 14: 2016-08-01 21:00:00 BD m16-23 2.25 15: 2016-08-02 00:00:00 BD m16-23 1.40 16: 2016-08-01 12:00:00 BD m24-31 3.25 17: 2016-08-01 15:00:00 BD m24-31 3.00 18: 2016-08-01 18:00:00 BD m24-31 6.50 19: 2016-08-01 21:00:00 BD m24-31 2.75 20: 2016-08-02 00:00:00 BD m24-31 3.40 21: 2016-08-01 12:00:00 BD m32-39 3.00 22: 2016-08-01 15:00:00 BD m32-39 4.00 23: 2016-08-01 18:00:00 BD m32-39 4.75 24: 2016-08-01 21:00:00 BD m32-39 3.00 25: 2016-08-02 00:00:00 BD m32-39 1.70 Datetime Site layer current 必须修改为包括与df1中的layerDatetime对应的列。

对于currents,使用Depth函数。最后一层cut()扩展到m32-39,因此,按照OP的要求,所有大于32 m的深度都包括在该层中。

对于Inf,使用滚动连接到Datetime中的最近 Datetime。之所以可以这样做是因为df2表示3小时时间范围的中点。

准备好df2$Datetime之后,我们可以进行“部分反连接”。不幸的是,df1的非等额联接不接受data.table运算符。所以,我们不能写

!=

直接但必须使用一种变通方法,在该方法中,我们首先选择期望匹配的行,然后执行反联接:

currents[df1, on = .(Datetime = current.dt, Site, layer != layer)]
 currents[df1, on = .(Datetime = current.dt, Site)][
    !df1, on = .(Datetime = current.dt, Site, layer)]

这可以通过任意聚合函数(不需要手动选择性地添加单个列)进行聚合:

               Datetime Site  layer current          i.Datetime Ind Depth i.layer
 1: 2016-08-01 15:00:00   BD  m8-15    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 2: 2016-08-01 15:00:00   BD m16-23    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 3: 2016-08-01 15:00:00   BD m24-31    3.00 2016-08-01 15:34:07  16   5.3    m0-7
 4: 2016-08-01 15:00:00   BD m32-39    4.00 2016-08-01 15:34:07  16   5.3    m0-7
 5: 2016-08-01 18:00:00   BD   m0-7    6.75 2016-08-01 17:29:16  19  36.4  m32-39
 6: 2016-08-01 18:00:00   BD  m8-15    4.75 2016-08-01 17:29:16  19  36.4  m32-39
 7: 2016-08-01 18:00:00   BD m16-23    5.75 2016-08-01 17:29:16  19  36.4  m32-39
 8: 2016-08-01 18:00:00   BD m24-31    6.50 2016-08-01 17:29:16  19  36.4  m32-39
 9: 2016-08-01 18:00:00   BD   m0-7    6.75 2016-08-01 18:33:16  16  42.0  m32-39
10: 2016-08-01 18:00:00   BD  m8-15    4.75 2016-08-01 18:33:16  16  42.0  m32-39
11: 2016-08-01 18:00:00   BD m16-23    5.75 2016-08-01 18:33:16  16  42.0  m32-39
12: 2016-08-01 18:00:00   BD m24-31    6.50 2016-08-01 18:33:16  16  42.0  m32-39
13: 2016-08-01 21:00:00   BD   m0-7    2.25 2016-08-01 20:54:16  17    NA    <NA>
14: 2016-08-01 21:00:00   BD  m8-15    3.00 2016-08-01 20:54:16  17    NA    <NA>
15: 2016-08-01 21:00:00   BD m16-23    2.25 2016-08-01 20:54:16  17    NA    <NA>
16: 2016-08-01 21:00:00   BD m24-31    2.75 2016-08-01 20:54:16  17    NA    <NA>
17: 2016-08-01 21:00:00   BD m32-39    3.00 2016-08-01 20:54:16  17    NA    <NA>
18: 2016-08-02 00:00:00   BD   m0-7    4.30 2016-08-01 22:48:16  16  22.1  m16-23
19: 2016-08-02 00:00:00   BD  m8-15    2.10 2016-08-01 22:48:16  16  22.1  m16-23
20: 2016-08-02 00:00:00   BD m24-31    3.40 2016-08-01 22:48:16  16  22.1  m16-23
21: 2016-08-02 00:00:00   BD m32-39    1.70 2016-08-01 22:48:16  16  22.1  m16-23
22:                <NA>   HG   <NA>      NA 2016-08-01 16:25:16  17  24.0  m16-23
               Datetime Site  layer current          i.Datetime Ind Depth i.layer
currents_other_layers <- 
  currents[df1, on = .(Datetime = current.dt, Site)][
    !df1, on = .(Datetime = current.dt, Site, layer)][
      !is.na(Depth), mean(current), by = .(i.Datetime, Site)]

currents_other_layers

此结果包含除观察到的层以外的所有其他层的平均电流。请注意,分组是根据 i.Datetime Site V1 1: 2016-08-01 15:34:07 BD 3.7500 2: 2016-08-01 17:29:16 BD 5.9375 3: 2016-08-01 18:33:16 BD 5.9375 4: 2016-08-01 22:48:16 BD 2.8750 5: 2016-08-01 16:25:16 HG NA 进行的,该i.Datetime是指df1$DatetimeSiteDepth中缺少df1的行将被省略,以满足OP的预期结果。

最终的 update join 将结果列附加到df1