查找并提取超出r中阈值的值

时间:2018-08-24 20:45:03

标签: r dplyr lookup threshold

我有两个数据框:

#df1
df1 = data.frame(id = c("A","B","C","D","E"), 
                 dev = c(213.5, 225.1, 198.9, 201.0, 266.8))
df1
   id   dev
1  A 213.5
2  B 225.1
3  C 198.9
4  D 201.0
5  E 266.8   

#df2
df2 = data.frame(DateTime = seq(
  from = as.POSIXct("1986-1-1 0:00"),
  to = as.POSIXct("1986-1-2 23:00"), 
  by = "hour"), 
  cum_dd = seq(from = 185, to = 295, by = 2.3)) 
head(df2) 
             DateTime cum_dd
1 1986-01-01 00:00:00  185.0
2 1986-01-01 01:00:00  187.3
3 1986-01-01 02:00:00  189.6
4 1986-01-01 03:00:00  191.9
5 1986-01-01 04:00:00  194.2
6 1986-01-01 05:00:00  196.5

我想在df1中添加一个新列,列出最早的df2 $ DateTime,其中df2 $ cum_dd超过df1 $ dev。

这是我想要的结果:

  id   dev             desired
1  A 213.5 1986-01-01 13:00:00
2  B 225.1 1986-01-01 18:00:00
3  C 198.9 1986-01-01 07:00:00
4  D 201.0 1986-01-01 07:00:00
5  E 266.8 1986-01-02 12:00:00

我熟悉dplyr中的min(which())函数,该函数的格式如下时,返回df2中的第一个行号,其中cum_dd大于200:

library(dplyr)
min(which (df2$cum_dd > 200))

实际上,我想为df1中的每一行运行此功能(用df1 $ dev代替“ 200”),并查找/提取相应的df2 $ DateTime值而不是行号。

我以为我已经接近了,但是还不完全正确,我在Stack Overflow中找不到类似的问题:

desired <- apply(df1, 1, 
           function (x) {ddply(df2, .(DateTime), summarize, 
           min(which (df2$cum_dd > df1$dev)))}) 

非常感谢您提出解决方案!

2 个答案:

答案 0 :(得分:3)

# example datasets
df1 = data.frame(id = c("A","B","C","D","E"), 
                 dev = c(213.5, 225.1, 198.9, 201.0, 266.8))

df2 = data.frame(DateTime = seq(
  from = as.POSIXct("1986-1-1 0:00"),
  to = as.POSIXct("1986-1-2 23:00"), 
  by = "hour"), 
  cum_dd = seq(from = 185, to = 295, by = 2.3)) 

library(tidyverse)

df1 %>% 
  crossing(df2) %>%         # get all combinations of rows
  group_by(id, dev) %>%     # for each id and dev
  summarise(desired = min(DateTime[cum_dd > dev])) %>%  # get minimum date when cumm_dd exeeds dev
  ungroup()                 # forget the grouping

# # A tibble: 5 x 3
#   id      dev desired            
#   <fct> <dbl> <dttm>             
# 1 A      214. 1986-01-01 13:00:00
# 2 B      225. 1986-01-01 18:00:00
# 3 C      199. 1986-01-01 07:00:00
# 4 D      201  1986-01-01 07:00:00
# 5 E      267. 1986-01-02 12:00:00

答案 1 :(得分:0)

library(tidyverse)
df1 = data.frame("id" = c("A","B","C","D","E"), "dev" = c(213.5, 225.1, 198.9, 201.0, 266.8))

df2 = data.frame("DateTime" = seq(
  from = as.POSIXct("1986-1-1 0:00"),
  to = as.POSIXct("1986-1-2 23:00"), 
  by = "hour"), 
  "cum_dd" = seq(from = 185, to = 295, by = 2.3)) 

df2 %>% 
  crossing(df1) %>% 
  filter(cum_dd > dev) %>% 
  arrange(DateTime, desc(cum_dd)) %>% 
  group_by(id) %>% 
  distinct(id, .keep_all = T)