R - 根据相应的主题ID选择一天的第一个条目(日期和时间)

时间:2017-08-15 02:49:44

标签: r date time

我试图通过每个主题ID选择每天的第一个注册条目来每天排序任何多个条目。

我正在处理一个非常大的数据集,所以这里只是我的数据结构的快照:

 df <- c(Contact.ID, Date.Time, Age, Gender, Attendance)

Contact.ID       Date.Time       Age   Gender   Attendance   
1   A       2012-07-06 18:54:48   37    Male         30    
2   A       2012-07-06 20:50:18   37    Male         30    
3   A       2012-08-14 20:18:44   37    Male         30   
4   B       2012-03-15 16:58:15   27  Female         40    
5   B       2012-04-18 10:57:02   27  Female         40    
6   B       2012-04-18 17:31:22   27  Female         40    
7   B       2012-04-18 18:37:00   27  Female         40    
8   C       2013-10-22 17:46:07   40    Male         5    
9   C       2013-10-27 11:21:00   40    Male         5    
10  D       2012-07-28 14:48:33   20  Female         12 

我尝试了一些不同的东西,例如:

t.first <- df[match(unique(df$Date.Time), df$Date.Time),]

setDT(df)[,.SD[which.max(df$Date.Time)],keyby=df$Contact.ID]

library(dplyr)
t.first <- ddply(df, "Date.Time", function(z) tail(z,1))

但是根据我的特定主题ID,他们都没有给我第一个条目。

所以我最后需要留下的是一个数据集:

Contact.ID       Date.Time       Age   Gender   Attendance   
1   A       2012-07-06 18:54:48   37    Male         29    
2   A       2012-08-14 20:18:44   37    Male         29   
3   B       2012-03-15 16:58:15   27  Female         38    
4   B       2012-04-18 10:57:02   27  Female         38    
5   C       2013-10-22 17:46:07   40    Male         5    
6   C       2013-10-27 11:21:00   40    Male         5    
7   D       2012-07-28 14:48:33   20  Female         12 

如果有人可以提供帮助,我一直坚持这个问题太久了。

2 个答案:

答案 0 :(得分:2)

来自dplyrlubridate的解决方案。我们可以将Date.Time转换为datetime类,创建一个名为date的新变量,按Contact.IDDate进行分组,然后选择每个组的最小记录。 dt2是最终输出。

library(dplyr)
library(lubridate)

dt2 <- dt %>%
  mutate(Date.Time = ymd_hms(Date.Time)) %>%
  mutate(Date = as.Date(Date.Time)) %>%
  group_by(Contact.ID, Date) %>%
  filter(Date.Time == min(Date.Time)) %>%
  ungroup() %>%
  select(-Date)

dt2
# A tibble: 7 x 5
  Contact.ID           Date.Time   Age Gender Attendance
       <chr>              <dttm> <int>  <chr>      <int>
1          A 2012-07-06 18:54:48    37   Male         30
2          A 2012-08-14 20:18:44    37   Male         30
3          B 2012-03-15 16:58:15    27 Female         40
4          B 2012-04-18 10:57:02    27 Female         40
5          C 2013-10-22 17:46:07    40   Male          5
6          C 2013-10-27 11:21:00    40   Male          5
7          D 2012-07-28 14:48:33    20 Female         12

数据准备

dt <- read.table(text = "'Contact.ID' 'Date.Time' Age Gender Attendance
1 A '2012-07-06 18:54:48' 37 Male 30
                 2 A '2012-07-06 20:50:18' 37 Male 30
                 3 A '2012-08-14 20:18:44' 37 Male 30
                 4 B '2012-03-15 16:58:15' 27 Female 40
                 5 B '2012-04-18 10:57:02' 27 Female 40
                 6 B '2012-04-18 17:31:22' 27 Female 40
                 7 B '2012-04-18 18:37:00' 27 Female 40
                 8 C '2013-10-22 17:46:07' 40 Male 5
                 9 C '2013-10-27 11:21:00' 40 Male 5
                 10 D '2012-07-28 14:48:33' 20 Female 12",
                 header = TRUE, stringsAsFactors = FALSE)

答案 1 :(得分:2)

使用dplyr :: slice()的另一个选项。这样可以防止重复。

library(dplyr)
library(lubridate)

dt2 <- dt %>%
  mutate(Date.Time = ymd_hms(Date.Time)) %>%
  mutate(Date = as.Date(Date.Time)) %>%
  group_by(Contact.ID, Date) %>%
  arrange(Date.Time) %>%
  slice(1) %>%
  ungroup() %>%
  select(-Date)