如何将R中的观察表更改为广泛的表进行队列分析?

时间:2018-03-26 15:01:03

标签: r dataframe

我有这个数据框:

        Date Visitor-ID
1 2018-01-01          1
2 2018-01-01          2
3 2018-01-01          3
4 2018-01-02          2
5 2018-01-02          3
6 2018-01-02          2
7 2018-01-03          2
8 2018-01-03          3

数据框由以下代码生成:

myDF=data.frame(c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02","2018-01-03","2018-01-03"),c(1,2,3,2,3,2,2,3))
names(myDF)=c("Date","Visitor-ID")

我想将原始数据框更改为此新数据框:

        Date   day 0    day 1   day 2
1 2018-01-01       3        2       2   
2 2018-01-02       2        2  
3 2018-01-03       2

在新数据框中,每个单元格是第x天唯一访问者的计数,他们已经在该行的给定日期到过那里。

问题: 我可以使用哪些代码行?

2 个答案:

答案 0 :(得分:1)

这是你需要的吗?

library(tidyr)
library(dplyr)
df=myDF%>%group_by(Date)%>%summarise(s=list(`Visitor-ID`))# convert to list to find the intersection after merge
df['key']=1# create a help key for merge , this will help to get the product combination
s=merge(df,df,by='key')
s['New']=apply(s,1,function(x) length(intersect(x$s.x, x$s.y)))# find the intersection of each
s['day']=as.Date(s$Date.y)-as.Date(s$Date.x)# get the date different 
s=s[s$day>=0,]# filter only for the next day , which means we only look forward not backward 
s[,c('Date.x','New','day')]%>%tidyr::spread(day,New)# reshape three column to matrix you need 

      Date.x 0  1  2
1 2018-01-01 3  2  2
2 2018-01-02 2  2 NA
3 2018-01-03 2 NA NA

答案 1 :(得分:0)

代码有些粗糙,但这应该对你有用,

myDF=data.frame(c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02","2018-01-03","2018-01-03"),c(1,2,3,2,3,2,2,3))
names(myDF)=c("Date","Visitor-ID")

myDF$Date <- as.Date(myDF$Date)
num.days <- as.numeric(max(myDF$Date) - min(myDF$Date))
new.cols.names <- paste("day", 0:num.days)

unique.dates <- unique(myDF$Date)
final.df <- matrix(0, ncol = length(new.cols.names)+1, nrow = length(unique.dates))
for (i in 1:length(unique.dates)){
  ids <- unique(myDF[myDF$Date == unique.dates[i], ]$`Visitor-ID`)
  for (j in 0:(as.numeric(max(myDF$Date) - unique.dates[i]))){
    final.df[i, j+2] <- sum(ids %in% myDF[myDF$Date == unique.dates[i] + j, ]$`Visitor-ID`)
  }
}
final.df <- data.frame(final.df)
names(final.df) <- c("Date", new.cols.names)
final.df$Date <- unique.dates

这可行,但对于大型数据集可能会很慢。您可以使用某种形式的sapply来提高效率。我希望这有帮助!