如何基于来自另一个data.frame的信息更新data.frame

时间:2019-10-28 12:23:27

标签: r

我有两个表:DisplayReviewReview表包含有关在线商店产品评论的信息。每行代表评论的日期,评论的累计数量以及该产品在该日期之前的平均评分。

page_id<-c("1072659", "1072659" , "1072659","1072650","1072660","1072660")  
review_id<-c("1761023","1761028","1762361","1918387","1761427","1863914")
date<-as.Date(c("2013-07-11","2013-08-12","2014-07-15","2014-09-10","2013-07-27","2014-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(1,2,3,1,1,2)
average_rating<-c(5,3.5,4,3,5,5)
Review<-data.frame(page_id,review_id,date,cumulative_No_reviews,average_rating)
page_id        review_id          date    cumulative_No_reviews   average_rating
1072659          1761023        2013-07-11      1                       5
1072659          1761028        2013-08-12      2                       3.5
1072659          1762361        2014-07-15      3                       4
1072650          1918387        2014-09-10      1                       3
1072660          1761427        2013-07-27      1                       5
1072660          1863914        2014-08-12      2                       5

Display表捕获客户访问产品页面时的数据。

page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
Display<-data.frame(page_id,date)
page_id         date        
1072659     2013-07-10      
1072659     2013-08-03      
1072659     2015-02-11      
1072650     2014-08-10  
1072650     2014-09-09      
1072660     2013-08-12      
1072660     2014-09-12      
1072660     2015-08-12      

我想在Display表中添加两列(称为Display2),以使其在访问每个产品时都能反映最新的评论信息,如下所示:

page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(0,1,3,0,0,1,2,2)
average_rating<-c(NA,5,4,NA,NA,5,5,5)
Display2<-data.frame(page_id,date,cumulative_No_reviews,average_rating)
 page_id            date        cumulative_No_reviews   average_rating
 1072659        2013-07-10                 0                NA
 1072659        2013-08-03                 1                5
 1072659        2015-02-11                 3                4
 1072650        2014-08-10                 0                NA
 1072650        2014-09-09                 0                NA
 1072660        2013-08-14                 1                5
 1072660        2014-09-11                 2                5
 1072660        2015-08-12                 2                5

感谢您的帮助。

1 个答案:

答案 0 :(得分:3)

您可以通过data.table联接来做到这一点。您可以在Review匹配且Display日期小于page_id日期的情况下将Review表与Display表联接。对于Display的某些行,将根据这些条件匹配Review的多行,因此对于mult = 'last',我们只是选择最后一行。由于Review是按日期排序的,因此这意味着日期最近。

library(data.table) # 1.12.6 for nafill (used below)
setDT(Display)
setDT(Review)

Display2 <- Review[Display, on = .(page_id, date < date), mult = 'last']
Display2
#    page_id review_id       date cumulative_No_reviews average_rating
# 1: 1072659      <NA> 2013-07-10                    NA             NA
# 2: 1072659   1761023 2013-08-03                     1              5
# 3: 1072659   1762361 2015-02-11                     3              4
# 4: 1072650      <NA> 2014-08-10                    NA             NA
# 5: 1072650      <NA> 2014-09-09                    NA             NA
# 6: 1072660   1761427 2013-08-12                     1              5
# 7: 1072660   1863914 2014-09-12                     2              5
# 8: 1072660   1863914 2015-08-12                     2              5

现在此输出几乎与您在问题中显示的内容匹配,我们只需要删除review_id列,并将NA列中的cumulative_No_reviews替换为0 s 。

Display2[, review_id := NULL]
Display2[, cumulative_No_reviews := nafill(cumulative_No_reviews, fill = 0)][]
#    page_id       date cumulative_No_reviews average_rating
# 1: 1072659 2013-07-10                     0             NA
# 2: 1072659 2013-08-03                     1              5
# 3: 1072659 2015-02-11                     3              4
# 4: 1072650 2014-08-10                     0             NA
# 5: 1072650 2014-09-09                     0             NA
# 6: 1072660 2013-08-12                     1              5
# 7: 1072660 2014-09-12                     2              5
# 8: 1072660 2015-08-12                     2              5