我有一个名为“diff2”的数据框,包含两个不同的时间点列(“原始”和“time_point”),同一行中这些时间点之间的差异(以小时为单位),以及与“原始”对应的ID ”。以下是数据框片段的示例:
diff original time_point ID
32 130 2012-12-16 04:59:32 2012-12-21 14:57:04 5
41 106 2012-12-16 06:01:02 2012-12-20 15:57:14 6
42 107 2012-12-16 06:01:02 2012-12-20 16:56:59 6
43 108 2012-12-16 06:01:02 2012-12-20 17:56:49 6
44 129 2012-12-16 06:01:02 2012-12-21 14:57:04 6
45 130 2012-12-16 06:01:02 2012-12-21 15:56:54 6
49 104 2012-12-16 06:59:52 2012-12-20 14:59:29 7
50 105 2012-12-16 06:59:52 2012-12-20 15:57:14 7
51 106 2012-12-16 06:59:52 2012-12-20 16:56:59 7
52 107 2012-12-16 06:59:52 2012-12-20 17:56:49 7
53 108 2012-12-16 06:59:52 2012-12-20 18:57:24 7
54 109 2012-12-16 06:59:52 2012-12-20 19:56:59 7
“原始”中的许多日期都与“time_point”中的日期相同。例如,“time_point”中的日期2012-12-20 15:57:14对于日期2012-12-16 06:01:02(ID#6)和2012-12-16 06:59:52(ID #7)在“原创”中。我需要首先在“time_point”中找到多个“原始”共有的日期。然后,对于“时间点”中的每个共同日期,我需要确定与之相关的“原始”的最早日期。然后,需要从与其关联的所有其他“原件”中删除此常见的“time_point”日期。我期望的结果数据框如下:
diff original time_point ID
32 130 2012-12-16 04:59:32 2012-12-21 14:57:04 5
41 106 2012-12-16 06:01:02 2012-12-20 15:57:14 6
42 107 2012-12-16 06:01:02 2012-12-20 16:56:59 6
43 108 2012-12-16 06:01:02 2012-12-20 17:56:49 6
44 129 2012-12-16 06:01:02 2012-12-21 14:57:04 6
45 130 2012-12-16 06:01:02 2012-12-21 15:56:54 6
49 104 2012-12-16 06:59:52 2012-12-20 14:59:29 7
53 108 2012-12-16 06:59:52 2012-12-20 18:57:24 7
54 109 2012-12-16 06:59:52 2012-12-20 19:56:59 7
我不知道如何解决这个问题,而不是一个比较ID的循环,并确定是否有共同的“time_point”日期。
答案 0 :(得分:0)
library(dplyr)
diff2 %>% group_by(time_point) %>%
mutate(counts = n()) %>% # count the occurrences of each time_point
filter(counts > 1) %>% # remove rows for singular time_points
arrange(time_point, original) %>% # put earliest original value in first position row for each time_point
slice(1) %>% # take only the top row of each time_point group
ungroup()
答案 1 :(得分:0)
基于函数的一些方法(假设您的数据属于类data.frame
):
## Finding the duplicated time points
duplicated_time_points <- which(duplicated(data$time_point))
## Finding the earliest "original" for multiple "time_points"
find.earliest.original <- function(time.point.duplicate, data) {
## Extract the originals
originals <- data$original[which(data$time_point == data$time_point[time.point.duplicate])]
## Finding the earliest original
return(min(format(originals, format = "%Y-%m-%d %H:%M:%S")))
}
## Applying this function to each duplicated dates
early_originals <- sapply(duplicated_time_points, find.earliest.original, data)
## Removing the time points that do not correspond to the earliest original from the data
remove.not.earliest.original <- function(time.point.duplicate, data) {
## Selecting the subdata with the duplicated time_points
sub_data <- which(data$time_point == data$time_point[time.point.duplicate])
## Selecting the rows in the subdata that are not the earliest original
return(sub_data[which(data$original[sub_data] != find.earliest.original(time.point.duplicate, data))])
}
## Applying this function to each duplicated dates
columns_to_remove <- sapply(duplicated_time_points, remove.not.earliest.original, data)
## Removing the columns
data <- data[-columns_to_remove,]
请注意,early_originals
变量未被使用,但可用于检查正在进行的操作。
这应该导致:
X diff original time_point ID
1 32 130 2012-12-16 04:59:32 2012-12-21 14:57:04 5
2 41 106 2012-12-16 06:01:02 2012-12-20 15:57:14 6
3 42 107 2012-12-16 06:01:02 2012-12-20 16:56:59 6
4 43 108 2012-12-16 06:01:02 2012-12-20 17:56:49 6
6 45 130 2012-12-16 06:01:02 2012-12-21 15:56:54 6
7 49 104 2012-12-16 06:59:52 2012-12-20 14:59:29 7
11 53 108 2012-12-16 06:59:52 2012-12-20 18:57:24 7
12 54 109 2012-12-16 06:59:52 2012-12-20 19:56:59 7
假设您确实要删除行ID 44
,这在上面的示例中是一个遗漏。