R中MAPPLY的替代方案

时间:2016-07-01 15:22:56

标签: r

我有一个包含以下内容的数据框:

1)存储 2)DayOfWeek
3)日期 4)销售
5)客户
6)打开 7)促销
8)StateHoliday 9)SchoolHoliday
10)StoreType
11)分类
12)竞争距离 13)比赛开幕时间 14)比赛开放时间 15)Promo2
16)Promo2SinceWeek 17)Promo2SinceYear 18)PromoInterval
19)CompanyDistanceBin
20)比赛开放时间日期 21)DaysSinceCompetionOpen

我正在尝试根据日期(基本上是日期 - 3个月)计算上一季度的平均销售额。但是,我还需要基于DayOfWeek和Promo的子集。我写了一个函数,我正在使用mapply。

quarter.store.sales.func <- function(storeId, storeDate, dayofweekvar, promotion)
{   
    storeDate = as.Date(storeDate,"%Y-%m-%d")
    EndDate = ymd(as.Date(storeDate)) + ddays(-1)
    EndDate = as.Date(storeDate,"%Y-%m-%d")
    StartDate = ymd(storeDate + months(-3))
    StartDate = as.Date(StartDate)

    quarterStoresales <- subset(saleswithstore, Date >= StartDate & Date <= EndDate & Store == storeId & DayOfWeek == dayofweekvar & Promo == promotion)
    quarterSales = 0
    salesDf <- ddply(quarterStoresales,.(Store),summarize,avgSales=mean(Sales))  

    if (nrow(salesDf)>0)
      quarterSales = as.numeric(round(salesDf$avgSales,digits=0))     

    return(quarterSales)
}

saleswithstore$QuarterSales <- mapply(quarter.store.sales.func, saleswithstore$Store, saleswithstore$Date, saleswithstore$DayOfWeek, saleswithstore$Promo)

 head(exampleset)
           Store         DayOfWeek Date               Sales           Promo
186            1                3  2013-06-05         5012            1
296            1                3  2013-04-10         4903            1
337            1                3  2013-05-29         5784            1
425            1                3  2013-05-08         5230            0
449            1                3  2013-04-03         4625            0
477            1                3  2013-03-27         6660            1

saleswithstore是一个包含1,000,000行的数据框。所以,这个解决方案是不可行的,因为它表现糟糕并且永远都是。有没有更好,更有效的方法在这样的数据帧上拥有一个特定的子集,然后像我想要的那样采取平均值?

我愿意接受任何建议。我承认自己是R的新人。

1 个答案:

答案 0 :(得分:0)

@ maubin0316,您的直觉在评论中是正确的,您可以通过其余变量进行分组。我使用data.table

汇总了这个例子
library(data.table)
set.seed(343)

# Create sample data
dt <- data.table('Store' = sample(1:10, 100, replace=T),
                 'DayOfWeek' = sample(1:7, 100, replace=T),
                 'Date' = sample(as.Date('2013-01-01'):as.Date('2013-06-30'), 100, replace=T),
                 'Sales' = sample(1000:10000, 100),
                 'Promo' = sample(c(0,1), 10, replace=T))

QuarterStartDate <- as.Date('2013-01-01')
QuarterEndDate <- as.Date('2013-03-31')

# Function to calculate your quarterly sales
QuarterlySales <- function(startDate, endDate, data){
  # Limit between your dates, group by your variables of interest
  data <- data[between(Date,startDate,endDate),list(TotalSales=sum(Sales)), by=list(Store,DayOfWeek,Promo)]
  # Sort in an order that makes sense
  data <- data[order(Store, DayOfWeek, Promo)]
  return(data)
}

salesSummary <- QuarterlySales(QuarterStartDate, QuarterEndDate, dt)
salesSummary