多个条件后的子集数据

时间:2017-02-04 21:58:22

标签: r subset

我有两个名为User和Master

的数据帧
User = read.csv(text = "
Ticket,Vehicle,Created
A,7164,1/1/2017
B,7163,1/2/2017
C,7162,26/1/2017", header = TRUE) 

Master = read.csv(text = "
Ticket,Vehicle,Created
E,7164,29/12/2016
F,7163,26/12/2017
G,7164,31/1/2017
R,7164,02/02/2017
H,7162,28/1/2017", header = TRUE)  

我希望在User数据框中我想为每个Ticket值的车辆编号以及{{{{}}的创建日期之后的列车i添加列i 1}}值。

Ex:  
**Output**  

Ticket Vehicle Created      Ticket.1    Ticket.2    
A        7164    1/1/2017    G             R
B        7163    1/2/2017    NA  
C        7162    26/1/2017   H  

因此,对于Vehicle 7164,Master中有两个条目,但只有一个在2017年1月1日之后,即G。

我尝试了以下代码

dfagg <- aggregate(Ticket ~ Vehicle + Created, Master, function(i) tail(i))
dfwide <- reshape(dfagg, timevar='Ticket', idvar=c('Vehicle'), direction="wide")
names(dfwide) <- gsub("Vehicle", "Ticket", names(dfwide))

但是,在车辆的创建日期与我的结果匹配后,我没有得到结果

2 个答案:

答案 0 :(得分:2)

注意:我认为F的日期是2016年12月26日(不是26/12/2017),否则输出错误。

实现这一目标的一种方法是使用sqldf包。

首先,将您的日期从字符转换为日期:

User$Created = as.Date(User$Created, format = "%d/%m/%Y")
Master$Created = as.Date(Master$Created, format = "%d/%m/%Y")

然后加入:

library(sqldf)

Output <- sqldf("select u.Ticket, u.Vehicle, u.Created, 
                        m.Ticket as Master_Ticket
                from User u left join Master m 
                  on (u.Vehicle = m.Vehicle and u.Created < m.Created)")

Output
#   Ticket Vehicle    Created Master_Ticket
# 1      A    7164 2017-01-01             G
# 2      A    7164 2017-01-01             R
# 3      B    7163 2017-02-01          <NA>
# 4      C    7162 2017-01-26             H

修改 如果您希望每个用户票证只有一行,则一种方法是聚合:

Output2 <- sqldf("select u.Ticket, u.Vehicle, u.Created,
                         group_concat(m.Ticket, ' ') as Master_Tickets
                 from User u left join Master m 
                   on (u.Vehicle = m.Vehicle and u.Created < m.Created)
                 group by u.Ticket, u.Vehicle, u.Created")

Output2
#   Ticket Vehicle    Created Master_Tickets
# 1      A    7164 2017-01-01            G R
# 2      B    7163 2017-02-01           <NA>
# 3      C    7162 2017-01-26              H

如果由于某种原因你绝对需要为每场比赛都有一列:

library(dplyr)
library(reshape2)

Output3 = Output %>%
  group_by(Ticket) %>%
  mutate(column_name = paste0('Ticket.', row_number())) %>%
  dcast(Ticket + Vehicle + Created ~ column_name, value.var = "Master_Ticket")

Output3
#   Ticket Vehicle    Created Ticket.1 Ticket.2
# 1      A    7164 2017-01-01        G        R
# 2      B    7163 2017-02-01     <NA>     <NA>
# 3      C    7162 2017-01-26        H     <NA>

答案 1 :(得分:1)

dplyr解决方案,包括@Scarabee指出的F的数据修正。 lubridate用于日期转换。可以添加dplyr::rename()以获得更有意义的列名。

library(lubridate)
User = read.csv(text = "
Ticket,Vehicle,Created
A,7164,1/1/2017
B,7163,1/2/2017
C,7162,26/1/2017", header = TRUE, stringsAsFactors=FALSE) 
User$Created <- dmy(User$Created)

Master = read.csv(text = "
Ticket,Vehicle,Created
E,7164,29/12/2016
F,7163,26/12/2016
G,7164,31/1/2017
H,7162,28/1/2017", header = TRUE, stringsAsFactors=FALSE) 
Master$Created <- dmy(Master$Created)

library(dplyr)
User %>% 
  left_join(Master, by="Vehicle") %>% # left takes every row from Master
  mutate(Ticket_y = ifelse(Created.x < Created.y, # apply date restriction
                           Ticket.y, NA)) %>%
  group_by(Ticket.x) %>%          # group by User ticket      
  arrange(desc(Ticket.y)) %>%     # push NA values to end
  filter(row_number() == 1 ) %>%  # keep only first row withing group
  ungroup() %>%                   # remove grouping
  select(Ticket.x, Created.x, Ticket_y) %>% # keep columns of interest
  arrange(Ticket.x)  # sort

      Ticket.x  Created.x Ticket_y
         <chr>     <date>    <chr>
# 
#   1        A 2017-01-01        G
#   2        B 2017-02-01     <NA>
#   3        C 2017-01-26        H