My goal is to obtain the average number of days it takes for a given product to be purchased. If Product_A is purchased three times over a given period ('2012-12-01','2012-12-05,'2012-12-10') then our average order interval will be the average of 4 & 5 - 4.5 days.
I wrote a For Loop to calculate the interval between two points (I can use the aggregate function to calculate my mean or median by product) but I keep getting a length error. This is supposed to be a scale-able solution
Here is a sample dataframe:
product_info <- data.frame(productId = c("A", "A", "A", "B","B","B"),
order_date = c("2014-05-01", "2014-05-05", "2014-05-10", "2014-06-01","2014-06-07", "2014-06-18"), stringsAsFactors=FALSE)
Here is my code:
for (i in 2:length(unique(product_info$productId))){
if(product_info$productId[i]==product_info$productId[i-1]){
product_info$interval[i] <- as.integer(difftime(product_info$order_date[i],product_info$order_date[i-1]))
}
}
My desired output should be:
product_info <- data.frame(productId = c("A", "A", "A", "B","B","B"),
order_date = c("2014-05-01", "2014-05-05", "2014-05-10", "2014-06-01","2014-06-07", "2014-06-18"),
interval= c("0", "4", "5", "0","6","11"), stringsAsFactors=FALSE)
Any help would be very much appreciated.
Thank you
3 个答案:
答案 0 :(得分:3)
You can try
product_info$order_date <- as.Date(product_info$order_date)
product_info$interval <- with(product_info, ave(as.numeric(order_date),
productId, FUN=function(x) c(0, diff(x))))
product_info
productId order_date interval
1 A 2014-05-01 0
2 A 2014-05-05 4
3 A 2014-05-10 5
4 B 2014-06-01 0
5 B 2014-06-07 6
6 B 2014-06-18 11
Or using data.table
library(data.table)#v1.9.5+
setDT(product_info)[,interval := c(0, diff(as.Date(order_date))) , productId]
If the 'order_date' is not ordered, we have to 'order` it before doing the 'diff'
setDT(product_info)[, order_date:= as.Date(order_date)
][order(order_date), interval :=as.numeric(order_date -
shift(order_date, fill=order_date[1L])) , by = productId]
# productId order_date interval
#1: A 2014-05-01 0
#2: A 2014-05-05 4
#3: A 2014-05-10 5
#4: B 2014-06-01 0
#5: B 2014-06-07 6
#6: B 2014-06-18 11
答案 1 :(得分:2)
Convert to date format -
product_info$order_date <- as.Date(product_info$order_date)
Using dplyr:
library(dplyr)
product_info %>% group_by(productId) %>%
mutate(interval=c(0,diff(order_date))
答案 2 :(得分:2)
这是一个dplyr
解决方案。您首先要转换为日期格式,然后按日期排序,按产品分组,最后添加列,这是此产品中最近两天之间的差异。请注意,0天已替换为NA
,其中恕我直言0
更适用。
library(dplyr)
product_info <- product_info %>%
mutate(order_date=as.Date(order_date)) %>%
arrange(order_date) %>%
group_by(productId) %>%
mutate(interval=order_date-lag(order_date))
product_info
productId order_date interval
1 A 2014-05-01 NA days
2 A 2014-05-05 4 days
3 A 2014-05-10 5 days
4 B 2014-06-01 NA days
5 B 2014-06-07 6 days
6 B 2014-06-18 11 days