我有一个数据框df1
,该数据框总结了不同时间段不同鱼类的深度。
另一方面,我有df2
总结了从地表到39米深度的时间间隔(每三小时)的电流强度,间隔为8米(m0-7
,{{ 1}},m8-15
,m16-23
和m24-31
)放在特定位置。例如:
m32-39
我想在df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
> df1
Datetime Site Ind Depth
1 2016-08-01 15:34:07 BD 16 5.3
2 2016-08-01 16:25:16 HG 17 24.0
3 2016-08-01 17:29:16 BD 19 36.4
4 2016-08-01 18:33:16 BD 16 42.0
5 2016-08-01 20:54:16 BD 17 NA
6 2016-08-01 22:48:16 BD 16 22.1
df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")
> df2
Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39
1 2016-08-01 12:00:00 BD 2.75 3.00 2.75 3.25 3.00
2 2016-08-01 15:00:00 BD 4.00 4.00 4.00 3.00 4.00
3 2016-08-01 18:00:00 BD 6.75 4.75 5.75 6.50 4.75
4 2016-08-01 21:00:00 BD 2.25 3.00 2.25 2.75 3.00
5 2016-08-02 00:00:00 BD 4.30 2.10 1.40 3.40 1.70
中创建一个变量,该变量反映鱼类不愿钓鱼的深层的平均电流。例如,如果鱼在20米深处,对应于df1
层,我想知道m16-23
,m0-7
,m8-15
和m24-31
。
注1:如果我的鱼的深度超过39米,我认为它好像是在最深的一层(m32-39
)。 m32-39
第4行中的一个示例。
注2:由于当前记录是每三个小时,所以df1
中指示的每小时代表一个多半小时和一个小时半。也就是说,df2$Datetime
中位于df2
处的电流强度反映了21:00:00
和19:30:00
之间的电流。其余时间也一样。
我希望这样:
22:30:00
有人知道怎么做吗?
答案 0 :(得分:2)
我将分两个步骤进行处理:
这是一个查找表:
library(tidyverse)
df2_long <- df2 %>%
gather(depth_rng, speed, `m0-7`:`m32-39`) %>%
separate(depth_rng, c("min_depth", "max_depth")) %>%
mutate_at(vars(matches("depth")), parse_number) %>%
# EDIT -- added to make deep category cover >39 too
mutate(max_depth = if_else(max_depth == 39, 10000, max_depth)) %>%
group_by(Datetime, Site) %>%
# Avg Speed elsewhere is the sum of all speeds, minus this speed, all divided by 4.
mutate(avg_speed_elsewhere = (sum(speed) - speed) / 4)
> df2_long
# A tibble: 25 x 6
# Groups: Datetime, Site [5]
Datetime Site min_depth max_depth speed avg_speed_elsewhere
<dttm> <fct> <dbl> <dbl> <dbl> <dbl>
1 2016-08-18 12:00:00 BD 0 7 2.75 3
2 2016-08-18 15:00:00 BD 0 7 4 3.75
3 2016-08-18 18:00:00 BD 0 7 6.75 5.44
4 2016-08-18 21:00:00 BD 0 7 2.25 2.75
5 2016-08-19 00:00:00 BD 0 7 4.3 2.15
6 2016-08-18 12:00:00 BD 8 15 3 2.94
7 2016-08-18 15:00:00 BD 8 15 4 3.75
8 2016-08-18 18:00:00 BD 8 15 4.75 5.94
9 2016-08-18 21:00:00 BD 8 15 3 2.56
10 2016-08-19 00:00:00 BD 8 15 2.1 2.7
# ... with 15 more rows
我希望这可以,但是您提供的数据不会重叠,因此我不确定:
df1 %>%
# EDIT - replaced floor_date with round_date
mutate(Datetime_3hr = lubridate::round_date(Datetime, "3 hour")) %>%
left_join(df2_long, by = c("Site", "Datetime_3hr" = "Datetime")) %>%
filter(Depth >= min_depth & Depth < max_depth + 1 | is.na(Depth))
答案 1 :(得分:1)
使用data.table
,您可以在两个数据库之间进行滚动连接,以使深度变量与当前变量相关联,即使时间不匹配也是如此。滚动联接的作用是将一个表与另一个表关联起来,时间最接近(根据您选择的选项)。我更改了一些数据,以使日期匹配
library(data.table)
df1<-data.frame(Datetime=c("2016-08-01 15:34:07","2016-08-01 16:25:16","2016-08-01 17:29:16","2016-08-01 18:33:16","2016-08-01 20:54:16","2016-08-01 22:48:16"),Site=c("BD","HG","BD","BD","BD","BD"),Ind=c(16,17,19,16,17,16), Depth=c(5.3,24,36.4,42,NA,22.1))
df1$Datetime<-as.POSIXct(df1$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df2<-data.frame(Datetime=c("2016-08-01 12:00:00","2016-08-01 15:00:00","2016-08-01 18:00:00","2016-08-01 21:00:00","2016-08-02 00:00:00"), Site=c("BD","BD","BD","BD","BD"),var1=c(2.75,4,6.75,2.25,4.3),var2=c(3,4,4.75,3,2.1),var3=c(2.75,4,5.75,2.25,1.4),var4=c(3.25,3,6.5,2.75,3.4),var5=c(3,4,4.75,3,1.7))
df2$Datetime<-as.POSIXct(df2$Datetime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
colnames(df2)<-c("Datetime","Site","m0-7","m8-15","m16-23","m24-31","m32-39")
setDT(df1)
setDT(df2)
setkey(df1, Site, Datetime)
setkey(df2, Site, Datetime)
df_merge = df2[df1, roll = Inf]
然后我使用dplyr的case_when计算其他深度的电流
library(dplyr)
df_merge[, current_elsewhere := case_when(
is.na(Depth) ~ NA_real_,
Depth < 7 ~ (`m8-15` + `m16-23` + `m24-31` + `m32-39`)/4,
Depth < 15 ~ (`m0-7` + `m16-23` + `m24-31` + `m32-39`)/4,
Depth < 23 ~ (`m0-7` + `m8-15` + `m24-31` + `m32-39`)/4,
Depth < 31 ~ (`m0-7` + `m8-15` + `m16-23` + `m32-39`)/4,
T ~ (`m0-7` + `m8-15` + `m16-23` + `m24-31`)/4)]
df_merge
Datetime Site m0-7 m8-15 m16-23 m24-31 m32-39 Ind Depth current_elsewhere
1: 2016-08-01 15:34:07 BD 4.00 4.00 4.00 3.00 4.00 16 5.3 3.7500
2: 2016-08-01 17:29:16 BD 4.00 4.00 4.00 3.00 4.00 19 36.4 3.7500
3: 2016-08-01 18:33:16 BD 6.75 4.75 5.75 6.50 4.75 16 42.0 5.9375
4: 2016-08-01 20:54:16 BD 6.75 4.75 5.75 6.50 4.75 17 NA NA
5: 2016-08-01 22:48:16 BD 2.25 3.00 2.25 2.75 3.00 16 22.1 2.7500
6: 2016-08-01 16:25:16 HG NA NA NA NA NA 17 24.0 NA
答案 2 :(得分:1)
这个问题包括有趣的挑战:
df2
和Datetime
的{{1}}中汇总当前数据匹配,但深度层不。 Site
在查找表中给出,其中每个值都与深度范围(深度层)和时间范围相关联3个小时。因此,需要将df2
中测得的Depth
和Datetime
映射到各自的范围。我尝试了不同的方法,但最终得到了下面的一种方法,该方法没有对聚合函数进行假设。因此,df1
可以直接调用。
mean()
library(data.table) library(magrittr) # reshape df2 from wide to long format currents <- melt(setDT(df2), id.vars = c("Datetime", "Site"), variable.name = "layer", value.name = "current") # create columns to join on labels <- names(df2) %>% stringr::str_subset("^m") breaks <- c(seq(0, 32, 8), Inf) setDT(df1)[, layer := cut(Depth, breaks = breaks, labels = labels)] df1[, current.dt := df2[df1, on = .(Site, Datetime), roll = "nearest", x.Datetime]] # "partial anti-join" to compute mean of other layers currents_other_layers <- currents[df1, on = .(Site, Datetime = current.dt)][ layer != i.layer, mean(current), by = .(i.Datetime, Site)] # append result column df1[currents_other_layers, on = .(Site, Datetime = i.Datetime), current.mean := i.V1] df1
这重现了OP的预期结果。
Datetime Site Ind Depth layer current.dt current.mean
1: 2016-08-01 15:34:07 BD 16 5.3 m0-7 2016-08-01 15:00:00 3.7500
2: 2016-08-01 16:25:16 HG 17 24.0 m16-23 <NA> NA
3: 2016-08-01 17:29:16 BD 19 36.4 m32-39 2016-08-01 18:00:00 5.9375
4: 2016-08-01 18:33:16 BD 16 42.0 m32-39 2016-08-01 18:00:00 5.9375
5: 2016-08-01 20:54:16 BD 17 NA <NA> 2016-08-01 21:00:00 NA
6: 2016-08-01 22:48:16 BD 16 22.1 m16-23 2016-08-02 00:00:00 2.8750
从宽格式改成长格式。这样可以在df2
列上加入/反加入。
layer
currents
现在, Datetime Site layer current
1: 2016-08-01 12:00:00 BD m0-7 2.75
2: 2016-08-01 15:00:00 BD m0-7 4.00
3: 2016-08-01 18:00:00 BD m0-7 6.75
4: 2016-08-01 21:00:00 BD m0-7 2.25
5: 2016-08-02 00:00:00 BD m0-7 4.30
6: 2016-08-01 12:00:00 BD m8-15 3.00
7: 2016-08-01 15:00:00 BD m8-15 4.00
8: 2016-08-01 18:00:00 BD m8-15 4.75
9: 2016-08-01 21:00:00 BD m8-15 3.00
10: 2016-08-02 00:00:00 BD m8-15 2.10
11: 2016-08-01 12:00:00 BD m16-23 2.75
12: 2016-08-01 15:00:00 BD m16-23 4.00
13: 2016-08-01 18:00:00 BD m16-23 5.75
14: 2016-08-01 21:00:00 BD m16-23 2.25
15: 2016-08-02 00:00:00 BD m16-23 1.40
16: 2016-08-01 12:00:00 BD m24-31 3.25
17: 2016-08-01 15:00:00 BD m24-31 3.00
18: 2016-08-01 18:00:00 BD m24-31 6.50
19: 2016-08-01 21:00:00 BD m24-31 2.75
20: 2016-08-02 00:00:00 BD m24-31 3.40
21: 2016-08-01 12:00:00 BD m32-39 3.00
22: 2016-08-01 15:00:00 BD m32-39 4.00
23: 2016-08-01 18:00:00 BD m32-39 4.75
24: 2016-08-01 21:00:00 BD m32-39 3.00
25: 2016-08-02 00:00:00 BD m32-39 1.70
Datetime Site layer current
必须修改为包括与df1
中的layer
和Datetime
对应的列。
对于currents
,使用Depth
函数。最后一层cut()
扩展到m32-39
,因此,按照OP的要求,所有大于32 m的深度都包括在该层中。
对于Inf
,使用滚动连接到Datetime
中的最近 Datetime
。之所以可以这样做是因为df2
表示3小时时间范围的中点。
准备好df2$Datetime
之后,我们可以进行“部分反连接”。不幸的是,df1
的非等额联接不接受data.table
运算符。所以,我们不能写
!=
直接但必须使用一种变通方法,在该方法中,我们首先选择期望匹配的行,然后执行反联接:
currents[df1, on = .(Datetime = current.dt, Site, layer != layer)]
currents[df1, on = .(Datetime = current.dt, Site)][ !df1, on = .(Datetime = current.dt, Site, layer)]
这可以通过任意聚合函数(不需要手动选择性地添加单个列)进行聚合:
Datetime Site layer current i.Datetime Ind Depth i.layer
1: 2016-08-01 15:00:00 BD m8-15 4.00 2016-08-01 15:34:07 16 5.3 m0-7
2: 2016-08-01 15:00:00 BD m16-23 4.00 2016-08-01 15:34:07 16 5.3 m0-7
3: 2016-08-01 15:00:00 BD m24-31 3.00 2016-08-01 15:34:07 16 5.3 m0-7
4: 2016-08-01 15:00:00 BD m32-39 4.00 2016-08-01 15:34:07 16 5.3 m0-7
5: 2016-08-01 18:00:00 BD m0-7 6.75 2016-08-01 17:29:16 19 36.4 m32-39
6: 2016-08-01 18:00:00 BD m8-15 4.75 2016-08-01 17:29:16 19 36.4 m32-39
7: 2016-08-01 18:00:00 BD m16-23 5.75 2016-08-01 17:29:16 19 36.4 m32-39
8: 2016-08-01 18:00:00 BD m24-31 6.50 2016-08-01 17:29:16 19 36.4 m32-39
9: 2016-08-01 18:00:00 BD m0-7 6.75 2016-08-01 18:33:16 16 42.0 m32-39
10: 2016-08-01 18:00:00 BD m8-15 4.75 2016-08-01 18:33:16 16 42.0 m32-39
11: 2016-08-01 18:00:00 BD m16-23 5.75 2016-08-01 18:33:16 16 42.0 m32-39
12: 2016-08-01 18:00:00 BD m24-31 6.50 2016-08-01 18:33:16 16 42.0 m32-39
13: 2016-08-01 21:00:00 BD m0-7 2.25 2016-08-01 20:54:16 17 NA <NA>
14: 2016-08-01 21:00:00 BD m8-15 3.00 2016-08-01 20:54:16 17 NA <NA>
15: 2016-08-01 21:00:00 BD m16-23 2.25 2016-08-01 20:54:16 17 NA <NA>
16: 2016-08-01 21:00:00 BD m24-31 2.75 2016-08-01 20:54:16 17 NA <NA>
17: 2016-08-01 21:00:00 BD m32-39 3.00 2016-08-01 20:54:16 17 NA <NA>
18: 2016-08-02 00:00:00 BD m0-7 4.30 2016-08-01 22:48:16 16 22.1 m16-23
19: 2016-08-02 00:00:00 BD m8-15 2.10 2016-08-01 22:48:16 16 22.1 m16-23
20: 2016-08-02 00:00:00 BD m24-31 3.40 2016-08-01 22:48:16 16 22.1 m16-23
21: 2016-08-02 00:00:00 BD m32-39 1.70 2016-08-01 22:48:16 16 22.1 m16-23
22: <NA> HG <NA> NA 2016-08-01 16:25:16 17 24.0 m16-23
Datetime Site layer current i.Datetime Ind Depth i.layer
currents_other_layers <- currents[df1, on = .(Datetime = current.dt, Site)][ !df1, on = .(Datetime = current.dt, Site, layer)][ !is.na(Depth), mean(current), by = .(i.Datetime, Site)] currents_other_layers
此结果包含除观察到的层以外的所有其他层的平均电流。请注意,分组是根据 i.Datetime Site V1
1: 2016-08-01 15:34:07 BD 3.7500
2: 2016-08-01 17:29:16 BD 5.9375
3: 2016-08-01 18:33:16 BD 5.9375
4: 2016-08-01 22:48:16 BD 2.8750
5: 2016-08-01 16:25:16 HG NA
进行的,该i.Datetime
是指df1$Datetime
和Site
。 Depth
中缺少df1
的行将被省略,以满足OP的预期结果。
最终的 update join 将结果列附加到df1
。