我试图通过每个主题ID选择每天的第一个注册条目来每天排序任何多个条目。
我正在处理一个非常大的数据集,所以这里只是我的数据结构的快照:
df <- c(Contact.ID, Date.Time, Age, Gender, Attendance)
Contact.ID Date.Time Age Gender Attendance
1 A 2012-07-06 18:54:48 37 Male 30
2 A 2012-07-06 20:50:18 37 Male 30
3 A 2012-08-14 20:18:44 37 Male 30
4 B 2012-03-15 16:58:15 27 Female 40
5 B 2012-04-18 10:57:02 27 Female 40
6 B 2012-04-18 17:31:22 27 Female 40
7 B 2012-04-18 18:37:00 27 Female 40
8 C 2013-10-22 17:46:07 40 Male 5
9 C 2013-10-27 11:21:00 40 Male 5
10 D 2012-07-28 14:48:33 20 Female 12
我尝试了一些不同的东西,例如:
t.first <- df[match(unique(df$Date.Time), df$Date.Time),]
setDT(df)[,.SD[which.max(df$Date.Time)],keyby=df$Contact.ID]
library(dplyr)
t.first <- ddply(df, "Date.Time", function(z) tail(z,1))
但是根据我的特定主题ID,他们都没有给我第一个条目。
所以我最后需要留下的是一个数据集:
Contact.ID Date.Time Age Gender Attendance
1 A 2012-07-06 18:54:48 37 Male 29
2 A 2012-08-14 20:18:44 37 Male 29
3 B 2012-03-15 16:58:15 27 Female 38
4 B 2012-04-18 10:57:02 27 Female 38
5 C 2013-10-22 17:46:07 40 Male 5
6 C 2013-10-27 11:21:00 40 Male 5
7 D 2012-07-28 14:48:33 20 Female 12
如果有人可以提供帮助,我一直坚持这个问题太久了。
答案 0 :(得分:2)
来自dplyr
和lubridate
的解决方案。我们可以将Date.Time
转换为datetime
类,创建一个名为date
的新变量,按Contact.ID
和Date
进行分组,然后选择每个组的最小记录。 dt2
是最终输出。
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Date.Time = ymd_hms(Date.Time)) %>%
mutate(Date = as.Date(Date.Time)) %>%
group_by(Contact.ID, Date) %>%
filter(Date.Time == min(Date.Time)) %>%
ungroup() %>%
select(-Date)
dt2
# A tibble: 7 x 5
Contact.ID Date.Time Age Gender Attendance
<chr> <dttm> <int> <chr> <int>
1 A 2012-07-06 18:54:48 37 Male 30
2 A 2012-08-14 20:18:44 37 Male 30
3 B 2012-03-15 16:58:15 27 Female 40
4 B 2012-04-18 10:57:02 27 Female 40
5 C 2013-10-22 17:46:07 40 Male 5
6 C 2013-10-27 11:21:00 40 Male 5
7 D 2012-07-28 14:48:33 20 Female 12
dt <- read.table(text = "'Contact.ID' 'Date.Time' Age Gender Attendance
1 A '2012-07-06 18:54:48' 37 Male 30
2 A '2012-07-06 20:50:18' 37 Male 30
3 A '2012-08-14 20:18:44' 37 Male 30
4 B '2012-03-15 16:58:15' 27 Female 40
5 B '2012-04-18 10:57:02' 27 Female 40
6 B '2012-04-18 17:31:22' 27 Female 40
7 B '2012-04-18 18:37:00' 27 Female 40
8 C '2013-10-22 17:46:07' 40 Male 5
9 C '2013-10-27 11:21:00' 40 Male 5
10 D '2012-07-28 14:48:33' 20 Female 12",
header = TRUE, stringsAsFactors = FALSE)
答案 1 :(得分:2)
使用dplyr :: slice()的另一个选项。这样可以防止重复。
library(dplyr)
library(lubridate)
dt2 <- dt %>%
mutate(Date.Time = ymd_hms(Date.Time)) %>%
mutate(Date = as.Date(Date.Time)) %>%
group_by(Contact.ID, Date) %>%
arrange(Date.Time) %>%
slice(1) %>%
ungroup() %>%
select(-Date)