对从ddply创建的列执行计算

时间:2014-11-09 21:30:17

标签: r dataframe plyr split-apply-combine

我在ddply中使用subset来计算某些指标并根据需要汇总表格。我想要计算的一些指标需要使用由ddply操作创建的汇总列。

以下是具有简单计算列的函数:

subset_by_market <- function (q, marketname, dp) {
  subset(ddply(df, .(quarter, R.DMA.NAMES, daypart, station), summarise, 
               spot.count = length(spot.id), 
               station.investment = sum(rate),
               nullspots.male = sum(nullspot.male),
               nullspots.allpersons = sum(nullspot.allpersons),
               total.male.imp = sum(male.imp),
               total.allpersons.imp = sum(allpersons.imp),
               spotvalue.male = sum(spotvalue.male),
               spotvalue.allpersons = sum(spotvalue.allpersons)),
         quarter == q & R.DMA.NAMES == marketname & daypart == dp)
}

我使用subset_by_market ("Q32013" , "Columbus.OH", "primetime")来总结创建子集。我得到的表格如下:

  quarter R.DMA.NAMES   daypart           station spot.count station.investment nullspots.male     nullspots.allpersons
10186  Q32013 Columbus.OH primetime ADSM COLUMBUS, OH        103               5150             67                   61
10187  Q32013 Columbus.OH primetime              ESYX         49                  0             49                   49
10188  Q32013 Columbus.OH primetime  MTV COLUMBUS, OH         61               4500              7                    1
10189  Q32013 Columbus.OH primetime     WCMH-Retro TV         94                564             93                   93
10190  Q32013 Columbus.OH primetime              WTTE          1                  0              0                    0
10191  Q32013 Columbus.OH primetime              WWHO          9                  0              2                    2
  total.male.imp total.allpersons.imp spotvalue.male spotvalue.allpersons
10186           47.2                127.7       4830.409            4775.1068
10187            0.0                  0.0            NaN                  NaN
10188          157.9                371.1       4649.746            4505.2608
10189            0.3                  0.3       3162.000            3162.0000
10190            3.5                 10.3        570.166             591.0231
10191            3.9                 15.8       7155.000            4356.4162

问题1:我想添加相同的数据框,例如:spot.count的百分比值。 = spot.count / sum(spot.count)(ii)percent.nullspots.male = nullspots.male / sum(nullspots.male)

但是,当我将其添加到ddply参数时,我在结果列中得到1(100%)。该值除以自身而不是除以列的总和。

问题2:这是缓慢且谦卑的我接受这可能不是最佳编码。我正在使用带有64位操作系统的16Gb ddr3 RAM的i5-2.6GHz PC。数据集是1M行。

system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
   user  system elapsed 
 228.13  176.84  416.12

目的是在在线仪表板上显示所有计算的指标,并允许用户使用下拉菜单选择subset_by_market (q , marketname, dp)。我怎样才能让它更快?

添加样本数据:

`> structure(list(market = c("Local", "Local", "Local", "Local", 
"Local", "Local", "Local", "NATIONAL CABLE", "Local", "Local"
), spot.id = c(11248955L, 11262196L, 11946349L, 11625265L, 12929889L, 
11259758L, 11517638L, 11599834L, 12527365L, 12930259L), date = structure(c(1375675200, 
1376625600, 1390280400, 1383627600, 1401249600, 1375848000, 1380772800, 
1383019200, 1397102400, 1401163200), class = c("POSIXct", "POSIXt"
), tzone = ""), hour = c(15, 17, 11, 18, 19, 1, 13, 14, 16, 22
), time = structure(c(0.642361111111111, 0.749305555555556, 0.481944444444444, 
0.770138888888889, 0.830555555555556, 0.0597222222222222, 0.582638888888889, 
0.597222222222222, 0.675694444444444, 0.930555555555556), format = "h:m:s", class = "times"), 
    local.date = structure(c(1375675200, 1376625600, 1390280400, 
    1383627600, 1401249600, 1375848000, 1380772800, 1383019200, 
    1397102400, 1401163200), class = c("POSIXct", "POSIXt"), tzone = ""), 
    local.hour = c(15, 17, 11, 18, 18, 0, 13, 14, 15, 22), local.time = structure(c(0.642361111111111, 
    0.749305555555556, 0.481944444444444, 0.770138888888889, 
    0.788888888888889, 0.0180555555555556, 0.582638888888889, 
    0.597222222222222, 0.634027777777778, 0.930555555555556), format = "h:m:s", class = "times"), 
    vendor = c("Time Warner - Myrtle Beach", "WMYD", "WSBK", 
    "WDCA", "Comcast - Memphis", "Charter Media - Birmingham", 
    "WBNA", "G4", "Comcast - Houston", "Comcast - Youngstown"
    ), station = c("VH-1 MYRTLE BEACH", "WMYD", "WSBK", "WDCA", 
    "COM MEMPHIS", "FX BIRMINGHAM", "WBNA", "G4", "SPK HOUSTON", 
    "COM YOUNGSTOWN CC"), male.imp = c(0, 2, 0, 0, 0.6, 0.4, 
    0, 0, 3.9, 0), women.imp = c(0, 2.5, 0, 2.5, 0.2, 0.6, 0, 
    0, 4.6, 0.6), allpersons.imp = c(0, 3.5, 0, 2.5, 0.8, 0.8, 
    0, 0, 7.8, 0.6), hh.imp = c(0, 8.5, 8, 64.5, 1.3, 2.9, 1.3, 
    15, 13.7, 1), isci = c("IT6140MB", "ITCD78DT", "IT6192BS", 
    "IT6170WD", "IT6173ME", "IT6162BI", "IT6155LO", "ITES13410", 
    "IT3917", "IT3921"), creative = c("Eugene Elbert (Bach. Tcom Eng. Tech) :60", 
    "The Problem Solvers (revised) - IET :60", "Murtech/Kinetic/Integra :60", 
    "Kevin Bumper/NTSG/Lifetime :60", "NCR/Schlumberger/Sprint (revised) (Bach) :60", 
    "Skills Gap (revised) /Kevin :60", "Rising Costs60 (Opportunity Scholar - No Nursing)", 
    "Irina Lund (Bach. ISS) :60", "Augustine Lopez (A. CEET) :30 (no loc)", 
    "John Ryan Ellis (B. PM/A. CDD) :30 (no loc)"), program = c(NA, 
    "TYLER PERRY'S MEET THE BROWNS", "THE PEOPLE'S COURT", "Judge Judy", 
    NA, NA, "Meet the Browns/Are We There Yet/News/Wendy Willia", 
    "HEROES", "Spike EF Rotator", NA), rate = c(5, 230, 100, 
    625, 40, 0, 15, 40, 110, 7), R.DMA.NAMES = c("Myrtle.Beach.Florence", 
    "Detroit", "Boston.Manchester.", "Washington.DC.Hagrstwn.", 
    "Memphis", "Birmingham.Ann.and.Tusc.", "Louisville", "NATIONAL CABLE", 
    "Houston", "Youngstown"), date.time = c("2013-08-05 15:25:00", 
    "2013-08-16 17:59:00", "2014-01-21 11:34:00", "2013-11-05 18:29:00", 
    "2014-05-28 19:56:00", "2013-08-07 01:26:00", "2013-10-03 13:59:00", 
    "2013-10-29 14:20:00", "2014-04-10 16:13:00", "2014-05-27 22:20:00"
    ), daypart = c("afternoon", "evening", "morning", "evening", 
    "evening", "late fringe", "afternoon", "afternoon", "afternoon", 
    "primetime"), quarter = structure(c(4L, 4L, 1L, 6L, 3L, 4L, 
    6L, 6L, 3L, 3L), .Label = c("Q12014", "Q22013", "Q22014", 
    "Q32013", "Q32014", "Q42013"), class = "factor"), cpi.allpersons = c(96.2179487179487, 
    79.0114068441065, 35.1219512195122, 82.3322348711803, 30, 
    0, 138.721804511278, 28.3135215453195, 28.2384088854449, 
    86.6666666666667), cpi.male = c(750.5, 188.882673751923, 
    115.959004392387, 144.492639327024, 38.9847715736041, 0, 
    595.161290322581, 34.7402005469462, 62.010777084515, 156.712328767123
    ), spotvalue.allpersons = c(0, 276.539923954373, 0, 205.830587177951, 
    24, 0, 0, 0, 220.25958930647, 52), spotvalue.male = c(0, 
    377.765347503846, 0, 0, 23.3908629441624, 0, 0, 0, 241.842030629609, 
    0), nullspot.allpersons = c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0), 
    nullspot.male = c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)), .Names = c("market", 
"spot.id", "date", "hour", "time", "local.date", "local.hour", 
"local.time", "vendor", "station", "male.imp", "women.imp", "allpersons.imp", 
"hh.imp", "isci", "creative", "program", "rate", "R.DMA.NAMES", 
"date.time", "daypart", "quarter", "cpi.allpersons", "cpi.male", 
"spotvalue.allpersons", "spotvalue.male", "nullspot.allpersons", 
"nullspot.male"), row.names = c(561147L, 261262L, 89888L, 941010L, 
500366L, 65954L, 484053L, 598996L, 380976L, 968615L), class = "data.frame")`

为丑陋的dput道歉。

1 个答案:

答案 0 :(得分:0)

这只回答了我关于使功能更快的第二个问题。根据@beginneR提示,我将函数转换为dplyr

subset_by_market <- function (q, marketname, dp) {

  subset(df %>% group_by(quarter, R.DMA.NAMES, daypart, station) %>%
  summarize (spot.count = length(spot.id), station.investment = sum(rate),
             nullspots.male = sum(nullspot.male),
             nullspots.allpersons = sum(nullspot.allpersons),
             total.male.imp = sum(male.imp),
             total.allpersons.imp = sum(allpersons.imp),
             spotvalue.male = sum(spotvalue.male),
             spotvalue.allpersons = sum(spotvalue.allpersons),
             male.imp.per.spot = total.male.imp / spot.count,
             allpersons.imp.per.spot = total.allpersons.imp / spot.count,
             cost.per.spot = station.investment / spot.count,
             male.value.per.spot = spotvalue.male / spot.count,
             allpersons.value.per.spot = spotvalue.allpersons / spot.count), 
  quarter == q & R.DMA.NAMES == marketname & daypart == dp) }

这大大缩短了时间:

> system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
   user  system elapsed 
   1.06    0.00    1.09 

我在使用dplyr时遇到的故障是我的数据中名为“time”的列,该列来自包times的类chron。我一直收到错误Error: column 'local.time' has unsupported type。我无法确定这方面的确切工作,所以我只是使用POSIXct将其转换为df$time <- as.POSIXct(as.character(df$time, format = "%H:%M:%S"))类。这不是最佳的,因为我使用times将其转换为chron的原因是为了保持时间年表而不需要日期或时区。更多相关内容:work around to specifying date range in ifelse statement in R。但是,它解决了眼前的问题。