我有两个名为User和Master
的数据帧User = read.csv(text = "
Ticket,Vehicle,Created
A,7164,1/1/2017
B,7163,1/2/2017
C,7162,26/1/2017", header = TRUE)
Master = read.csv(text = "
Ticket,Vehicle,Created
E,7164,29/12/2016
F,7163,26/12/2017
G,7164,31/1/2017
R,7164,02/02/2017
H,7162,28/1/2017", header = TRUE)
我希望在User
数据框中我想为每个Ticket
值的车辆编号以及{{{{}}的创建日期之后的列车i
添加列i
1}}值。
Ex:
**Output**
Ticket Vehicle Created Ticket.1 Ticket.2
A 7164 1/1/2017 G R
B 7163 1/2/2017 NA
C 7162 26/1/2017 H
因此,对于Vehicle 7164,Master
中有两个条目,但只有一个在2017年1月1日之后,即G。
我尝试了以下代码
dfagg <- aggregate(Ticket ~ Vehicle + Created, Master, function(i) tail(i))
dfwide <- reshape(dfagg, timevar='Ticket', idvar=c('Vehicle'), direction="wide")
names(dfwide) <- gsub("Vehicle", "Ticket", names(dfwide))
但是,在车辆的创建日期与我的结果匹配后,我没有得到结果
答案 0 :(得分:2)
注意:我认为F的日期是2016年12月26日(不是26/12/2017),否则输出错误。
实现这一目标的一种方法是使用sqldf
包。
首先,将您的日期从字符转换为日期:
User$Created = as.Date(User$Created, format = "%d/%m/%Y")
Master$Created = as.Date(Master$Created, format = "%d/%m/%Y")
然后加入:
library(sqldf)
Output <- sqldf("select u.Ticket, u.Vehicle, u.Created,
m.Ticket as Master_Ticket
from User u left join Master m
on (u.Vehicle = m.Vehicle and u.Created < m.Created)")
Output
# Ticket Vehicle Created Master_Ticket
# 1 A 7164 2017-01-01 G
# 2 A 7164 2017-01-01 R
# 3 B 7163 2017-02-01 <NA>
# 4 C 7162 2017-01-26 H
修改强> 如果您希望每个用户票证只有一行,则一种方法是聚合:
Output2 <- sqldf("select u.Ticket, u.Vehicle, u.Created,
group_concat(m.Ticket, ' ') as Master_Tickets
from User u left join Master m
on (u.Vehicle = m.Vehicle and u.Created < m.Created)
group by u.Ticket, u.Vehicle, u.Created")
Output2
# Ticket Vehicle Created Master_Tickets
# 1 A 7164 2017-01-01 G R
# 2 B 7163 2017-02-01 <NA>
# 3 C 7162 2017-01-26 H
如果由于某种原因你绝对需要为每场比赛都有一列:
library(dplyr)
library(reshape2)
Output3 = Output %>%
group_by(Ticket) %>%
mutate(column_name = paste0('Ticket.', row_number())) %>%
dcast(Ticket + Vehicle + Created ~ column_name, value.var = "Master_Ticket")
Output3
# Ticket Vehicle Created Ticket.1 Ticket.2
# 1 A 7164 2017-01-01 G R
# 2 B 7163 2017-02-01 <NA> <NA>
# 3 C 7162 2017-01-26 H <NA>
答案 1 :(得分:1)
dplyr
解决方案,包括@Scarabee指出的F的数据修正。 lubridate
用于日期转换。可以添加dplyr::rename()
以获得更有意义的列名。
library(lubridate)
User = read.csv(text = "
Ticket,Vehicle,Created
A,7164,1/1/2017
B,7163,1/2/2017
C,7162,26/1/2017", header = TRUE, stringsAsFactors=FALSE)
User$Created <- dmy(User$Created)
Master = read.csv(text = "
Ticket,Vehicle,Created
E,7164,29/12/2016
F,7163,26/12/2016
G,7164,31/1/2017
H,7162,28/1/2017", header = TRUE, stringsAsFactors=FALSE)
Master$Created <- dmy(Master$Created)
library(dplyr)
User %>%
left_join(Master, by="Vehicle") %>% # left takes every row from Master
mutate(Ticket_y = ifelse(Created.x < Created.y, # apply date restriction
Ticket.y, NA)) %>%
group_by(Ticket.x) %>% # group by User ticket
arrange(desc(Ticket.y)) %>% # push NA values to end
filter(row_number() == 1 ) %>% # keep only first row withing group
ungroup() %>% # remove grouping
select(Ticket.x, Created.x, Ticket_y) %>% # keep columns of interest
arrange(Ticket.x) # sort
Ticket.x Created.x Ticket_y
<chr> <date> <chr>
#
# 1 A 2017-01-01 G
# 2 B 2017-02-01 <NA>
# 3 C 2017-01-26 H