识别R中多个ID共有的日期

时间:2018-02-01 01:49:01

标签: r

我有一个名为“diff2”的数据框,包含两个不同的时间点列(“原始”和“time_point”),同一行中这些时间点之间的差异(以小时为单位),以及与“原始”对应的ID ”。以下是数据框片段的示例:

 diff            original          time_point ID
32  130 2012-12-16 04:59:32 2012-12-21 14:57:04  5
41  106 2012-12-16 06:01:02 2012-12-20 15:57:14  6
42  107 2012-12-16 06:01:02 2012-12-20 16:56:59  6
43  108 2012-12-16 06:01:02 2012-12-20 17:56:49  6
44  129 2012-12-16 06:01:02 2012-12-21 14:57:04  6
45  130 2012-12-16 06:01:02 2012-12-21 15:56:54  6
49  104 2012-12-16 06:59:52 2012-12-20 14:59:29  7
50  105 2012-12-16 06:59:52 2012-12-20 15:57:14  7
51  106 2012-12-16 06:59:52 2012-12-20 16:56:59  7
52  107 2012-12-16 06:59:52 2012-12-20 17:56:49  7
53  108 2012-12-16 06:59:52 2012-12-20 18:57:24  7
54  109 2012-12-16 06:59:52 2012-12-20 19:56:59  7

“原始”中的许多日期都与“time_point”中的日期相同。例如,“time_point”中的日期2012-12-20 15:57:14对于日期2012-12-16 06:01:02(ID#6)和2012-12-16 06:59:52(ID #7)在“原创”中。我需要首先在“time_point”中找到多个“原始”共有的日期。然后,对于“时间点”中的每个共同日期,我需要确定与之相关的“原始”的最早日期。然后,需要从与其关联的所有其他“原件”中删除此常见的“time_point”日期。我期望的结果数据框如下:

 diff            original          time_point ID
32  130 2012-12-16 04:59:32 2012-12-21 14:57:04  5
41  106 2012-12-16 06:01:02 2012-12-20 15:57:14  6
42  107 2012-12-16 06:01:02 2012-12-20 16:56:59  6
43  108 2012-12-16 06:01:02 2012-12-20 17:56:49  6
44  129 2012-12-16 06:01:02 2012-12-21 14:57:04  6
45  130 2012-12-16 06:01:02 2012-12-21 15:56:54  6
49  104 2012-12-16 06:59:52 2012-12-20 14:59:29  7
53  108 2012-12-16 06:59:52 2012-12-20 18:57:24  7
54  109 2012-12-16 06:59:52 2012-12-20 19:56:59  7

我不知道如何解决这个问题,而不是一个比较ID的循环,并确定是否有共同的“time_point”日期。

2 个答案:

答案 0 :(得分:0)

library(dplyr)

diff2 %>% group_by(time_point) %>%
  mutate(counts = n()) %>%  # count the occurrences of each time_point
  filter(counts > 1) %>% # remove rows for singular time_points
  arrange(time_point, original) %>% # put earliest original value in first position row for each time_point
  slice(1) %>% # take only the top row of each time_point group  
  ungroup()

答案 1 :(得分:0)

基于函数的一些方法(假设您的数据属于类data.frame):

## Finding the duplicated time points
duplicated_time_points <- which(duplicated(data$time_point))

## Finding the earliest "original" for multiple "time_points"
find.earliest.original <- function(time.point.duplicate, data) {

    ## Extract the originals
    originals <- data$original[which(data$time_point == data$time_point[time.point.duplicate])]

    ## Finding the earliest original
    return(min(format(originals, format = "%Y-%m-%d %H:%M:%S")))
}

## Applying this function to each duplicated dates
early_originals <- sapply(duplicated_time_points, find.earliest.original, data)

## Removing the time points that do not correspond to the earliest original from the data
remove.not.earliest.original <- function(time.point.duplicate, data) {
    ## Selecting the subdata with the duplicated time_points
    sub_data <- which(data$time_point == data$time_point[time.point.duplicate])

    ## Selecting the rows in the subdata that are not the earliest original
    return(sub_data[which(data$original[sub_data] != find.earliest.original(time.point.duplicate, data))])
}

## Applying this function to each duplicated dates
columns_to_remove <- sapply(duplicated_time_points, remove.not.earliest.original, data)

## Removing the columns
data <- data[-columns_to_remove,]

请注意,early_originals变量未被使用,但可用于检查正在进行的操作。

这应该导致:

    X diff            original          time_point ID
1  32  130 2012-12-16 04:59:32 2012-12-21 14:57:04  5
2  41  106 2012-12-16 06:01:02 2012-12-20 15:57:14  6
3  42  107 2012-12-16 06:01:02 2012-12-20 16:56:59  6
4  43  108 2012-12-16 06:01:02 2012-12-20 17:56:49  6
6  45  130 2012-12-16 06:01:02 2012-12-21 15:56:54  6
7  49  104 2012-12-16 06:59:52 2012-12-20 14:59:29  7
11 53  108 2012-12-16 06:59:52 2012-12-20 18:57:24  7
12 54  109 2012-12-16 06:59:52 2012-12-20 19:56:59  7

假设您确实要删除行ID 44,这在上面的示例中是一个遗漏。