我有两个表:Display
和Review
。 Review
表包含有关在线商店产品评论的信息。每行代表评论的日期,评论的累计数量以及该产品在该日期之前的平均评分。
page_id<-c("1072659", "1072659" , "1072659","1072650","1072660","1072660")
review_id<-c("1761023","1761028","1762361","1918387","1761427","1863914")
date<-as.Date(c("2013-07-11","2013-08-12","2014-07-15","2014-09-10","2013-07-27","2014-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(1,2,3,1,1,2)
average_rating<-c(5,3.5,4,3,5,5)
Review<-data.frame(page_id,review_id,date,cumulative_No_reviews,average_rating)
page_id review_id date cumulative_No_reviews average_rating
1072659 1761023 2013-07-11 1 5
1072659 1761028 2013-08-12 2 3.5
1072659 1762361 2014-07-15 3 4
1072650 1918387 2014-09-10 1 3
1072660 1761427 2013-07-27 1 5
1072660 1863914 2014-08-12 2 5
Display
表捕获客户访问产品页面时的数据。
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
Display<-data.frame(page_id,date)
page_id date
1072659 2013-07-10
1072659 2013-08-03
1072659 2015-02-11
1072650 2014-08-10
1072650 2014-09-09
1072660 2013-08-12
1072660 2014-09-12
1072660 2015-08-12
我想在Display
表中添加两列(称为Display2
),以使其在访问每个产品时都能反映最新的评论信息,如下所示:
page_id<-c("1072659","1072659","1072659","1072650","1072650","1072660","1072660","1072660")
date<-as.Date(c("2013-07-10","2013-08-03","2015-02-11","2014-08-10","2014-09-09","2013-08-12","2014-09-12","2015-08-12"),format = "%Y-%m-%d")
cumulative_No_reviews<-c(0,1,3,0,0,1,2,2)
average_rating<-c(NA,5,4,NA,NA,5,5,5)
Display2<-data.frame(page_id,date,cumulative_No_reviews,average_rating)
page_id date cumulative_No_reviews average_rating
1072659 2013-07-10 0 NA
1072659 2013-08-03 1 5
1072659 2015-02-11 3 4
1072650 2014-08-10 0 NA
1072650 2014-09-09 0 NA
1072660 2013-08-14 1 5
1072660 2014-09-11 2 5
1072660 2015-08-12 2 5
感谢您的帮助。
答案 0 :(得分:3)
您可以通过data.table
联接来做到这一点。您可以在Review
匹配且Display
日期小于page_id
日期的情况下将Review
表与Display
表联接。对于Display
的某些行,将根据这些条件匹配Review
的多行,因此对于mult = 'last'
,我们只是选择最后一行。由于Review
是按日期排序的,因此这意味着日期最近。
library(data.table) # 1.12.6 for nafill (used below)
setDT(Display)
setDT(Review)
Display2 <- Review[Display, on = .(page_id, date < date), mult = 'last']
Display2
# page_id review_id date cumulative_No_reviews average_rating
# 1: 1072659 <NA> 2013-07-10 NA NA
# 2: 1072659 1761023 2013-08-03 1 5
# 3: 1072659 1762361 2015-02-11 3 4
# 4: 1072650 <NA> 2014-08-10 NA NA
# 5: 1072650 <NA> 2014-09-09 NA NA
# 6: 1072660 1761427 2013-08-12 1 5
# 7: 1072660 1863914 2014-09-12 2 5
# 8: 1072660 1863914 2015-08-12 2 5
现在此输出几乎与您在问题中显示的内容匹配,我们只需要删除review_id
列,并将NA
列中的cumulative_No_reviews
替换为0
s 。
Display2[, review_id := NULL]
Display2[, cumulative_No_reviews := nafill(cumulative_No_reviews, fill = 0)][]
# page_id date cumulative_No_reviews average_rating
# 1: 1072659 2013-07-10 0 NA
# 2: 1072659 2013-08-03 1 5
# 3: 1072659 2015-02-11 3 4
# 4: 1072650 2014-08-10 0 NA
# 5: 1072650 2014-09-09 0 NA
# 6: 1072660 2013-08-12 1 5
# 7: 1072660 2014-09-12 2 5
# 8: 1072660 2015-08-12 2 5