我有一个包含以下内容的数据框:
1)存储
2)DayOfWeek
3)日期
4)销售
5)客户
6)打开
7)促销
8)StateHoliday
9)SchoolHoliday
10)StoreType
11)分类
12)竞争距离
13)比赛开幕时间
14)比赛开放时间
15)Promo2
16)Promo2SinceWeek
17)Promo2SinceYear
18)PromoInterval
19)CompanyDistanceBin
20)比赛开放时间日期
21)DaysSinceCompetionOpen
我正在尝试根据日期(基本上是日期 - 3个月)计算上一季度的平均销售额。但是,我还需要基于DayOfWeek和Promo的子集。我写了一个函数,我正在使用mapply。
quarter.store.sales.func <- function(storeId, storeDate, dayofweekvar, promotion)
{
storeDate = as.Date(storeDate,"%Y-%m-%d")
EndDate = ymd(as.Date(storeDate)) + ddays(-1)
EndDate = as.Date(storeDate,"%Y-%m-%d")
StartDate = ymd(storeDate + months(-3))
StartDate = as.Date(StartDate)
quarterStoresales <- subset(saleswithstore, Date >= StartDate & Date <= EndDate & Store == storeId & DayOfWeek == dayofweekvar & Promo == promotion)
quarterSales = 0
salesDf <- ddply(quarterStoresales,.(Store),summarize,avgSales=mean(Sales))
if (nrow(salesDf)>0)
quarterSales = as.numeric(round(salesDf$avgSales,digits=0))
return(quarterSales)
}
saleswithstore$QuarterSales <- mapply(quarter.store.sales.func, saleswithstore$Store, saleswithstore$Date, saleswithstore$DayOfWeek, saleswithstore$Promo)
head(exampleset)
Store DayOfWeek Date Sales Promo
186 1 3 2013-06-05 5012 1
296 1 3 2013-04-10 4903 1
337 1 3 2013-05-29 5784 1
425 1 3 2013-05-08 5230 0
449 1 3 2013-04-03 4625 0
477 1 3 2013-03-27 6660 1
saleswithstore是一个包含1,000,000行的数据框。所以,这个解决方案是不可行的,因为它表现糟糕并且永远都是。有没有更好,更有效的方法在这样的数据帧上拥有一个特定的子集,然后像我想要的那样采取平均值?
我愿意接受任何建议。我承认自己是R的新人。
答案 0 :(得分:0)
@ maubin0316,您的直觉在评论中是正确的,您可以通过其余变量进行分组。我使用data.table
library(data.table)
set.seed(343)
# Create sample data
dt <- data.table('Store' = sample(1:10, 100, replace=T),
'DayOfWeek' = sample(1:7, 100, replace=T),
'Date' = sample(as.Date('2013-01-01'):as.Date('2013-06-30'), 100, replace=T),
'Sales' = sample(1000:10000, 100),
'Promo' = sample(c(0,1), 10, replace=T))
QuarterStartDate <- as.Date('2013-01-01')
QuarterEndDate <- as.Date('2013-03-31')
# Function to calculate your quarterly sales
QuarterlySales <- function(startDate, endDate, data){
# Limit between your dates, group by your variables of interest
data <- data[between(Date,startDate,endDate),list(TotalSales=sum(Sales)), by=list(Store,DayOfWeek,Promo)]
# Sort in an order that makes sense
data <- data[order(Store, DayOfWeek, Promo)]
return(data)
}
salesSummary <- QuarterlySales(QuarterStartDate, QuarterEndDate, dt)
salesSummary