根据两个条件以及另一个数据框中两个日期/时间之间的范围在数据框中创建新列

时间:2020-08-03 04:03:01

标签: r datetime mutate posixct

我有两个数据框df1和df2,我试图在df1中创建一个基于df2 $ number的新列(数字),以便我可以对df1 $ rate进行分组和执行一些汇总计算。但是,还需要基于df1 $ person和df2 $ person中的值以及df1 $ timestamp中由df2中的起始值和结束值指定的一系列时间戳(即,df1 $ person和df2 $ person)之间的匹配来创建此新列(df1 $ number)。 。df2 $ start和df2 $ end),因为每个人都有多个df2 $ number值,并且每个df2 $ end值都是唯一的。

类似于Take dates from one dataframe and filter data in another dataframe 但是,我有df2 $ number为每个人指定的多个起始值和结束值,而不仅仅是过滤,我想创建一个新列来标识满足上述条件的每一行。

df1

df1 <- data.frame(person = c(rep('A', 13), rep('B', 20)), timestamp = as.POSIXct(c('2020-01-03 14:19:59','2020-01-03 14:20:00','2020-01-03 14:20:01','2020-01-03 14:20:02','2020-01-03 14:20:03','2020-01-03 14:20:04','2020-01-03 14:20:05','2020-01-03 14:20:06', '2020-01-03 15:58:00', '2020-01-03 15:58:01', '2020-01-03 15:58:02', '2020-01-03 15:58:03', '2020-01-03 15:58:04', '2020-01-03 14:19:58','2020-01-03 14:19:59', '2020-01-03 14:20:00', '2020-01-03 14:20:01', '2020-01-03 14:20:02', '2020-01-03 14:20:03', '2020-01-03 14:20:04', '2020-01-03 14:20:05', '2020-01-03 14:20:06', '2020-01-03 14:20:07', '2020-01-03 14:20:08', '2020-01-03 15:57:59', '2020-01-03 15:58:00', '2020-01-03 15:58:01', '2020-01-03 15:58:02', '2020-01-03 15:58:03', '2020-01-03 15:58:04', '2020-01-03 15:58:05', '2020-01-03 15:58:06', '2020-01-03 15:58:07')), rate = c(150, 151, 152, 152, 153, 153, 154, 154, 145, 146, 145, 145, 146, 160, 160, 161, 161, 161, 162, 162, 162, 162, 162, 162, 135, 135, 134, 134, 134, 133, 133, 133, 134) )

df2

df2 <- data.frame(person = c('A', 'B', 'A', 'B'), number = as.integer(c(1, 1, 2, 2)), start = as.POSIXct(c('2020-01-03 14:20:00', '2020-01-03 14:20:00', '2020-01-03 15:58:00', '2020-01-03 15:58:00')),  end = as.POSIXct(c('2020-01-03 14:20:04', '2020-01-03 14:20:07', '2020-01-03 15:58:03', '2020-01-03 15:58:05'))) 

理想情况下,我希望生成的数据帧看起来像这样:

df3 <- data.frame(person = c(rep('A', 13), rep('B', 20)), timestamp = as.POSIXct(c('2020-01-03 14:19:59','2020-01-03 14:20:00','2020-01-03 14:20:01','2020-01-03 14:20:02','2020-01-03 14:20:03','2020-01-03 14:20:04','2020-01-03 14:20:05','2020-01-03 14:20:06', '2020-01-03 15:58:00', '2020-01-03 15:58:01', '2020-01-03 15:58:02', '2020-01-03 15:58:03', '2020-01-03 15:58:04', '2020-01-03 14:19:58','2020-01-03 14:19:59', '2020-01-03 14:20:00', '2020-01-03 14:20:01', '2020-01-03 14:20:02', '2020-01-03 14:20:03', '2020-01-03 14:20:04', '2020-01-03 14:20:05', '2020-01-03 14:20:06', '2020-01-03 14:20:07', '2020-01-03 14:20:08', '2020-01-03 15:57:59', '2020-01-03 15:58:00', '2020-01-03 15:58:01', '2020-01-03 15:58:02', '2020-01-03 15:58:03', '2020-01-03 15:58:04', '2020-01-03 15:58:05', '2020-01-03 15:58:06', '2020-01-03 15:58:07')), rate = c(150, 151, 152, 152, 153, 153, 154, 154, 145, 146, 145, 145, 146, 160, 160, 161, 161, 161, 162, 162, 162, 162, 162, 162, 135, 135, 134, 134, 134, 133, 133, 133, 134), number = c(NA, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, NA, NA, NA, 1, 1, 1, 1, 1, 1, 1, 1, NA, NA, 2, 2, 2, 2, 2, 2, NA, NA ) )

实际上df1的行数超过80万,我在df2 $中有1到11个10人(df2 $ person),所以我正在寻找一种有效的方法。

如果需要澄清或提供更多信息,请告诉我。

谢谢

0 个答案:

没有答案