我在ddply
中使用subset
来计算某些指标并根据需要汇总表格。我想要计算的一些指标需要使用由ddply
操作创建的汇总列。
以下是具有简单计算列的函数:
subset_by_market <- function (q, marketname, dp) {
subset(ddply(df, .(quarter, R.DMA.NAMES, daypart, station), summarise,
spot.count = length(spot.id),
station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons)),
quarter == q & R.DMA.NAMES == marketname & daypart == dp)
}
我使用subset_by_market ("Q32013" , "Columbus.OH", "primetime")
来总结创建子集。我得到的表格如下:
quarter R.DMA.NAMES daypart station spot.count station.investment nullspots.male nullspots.allpersons
10186 Q32013 Columbus.OH primetime ADSM COLUMBUS, OH 103 5150 67 61
10187 Q32013 Columbus.OH primetime ESYX 49 0 49 49
10188 Q32013 Columbus.OH primetime MTV COLUMBUS, OH 61 4500 7 1
10189 Q32013 Columbus.OH primetime WCMH-Retro TV 94 564 93 93
10190 Q32013 Columbus.OH primetime WTTE 1 0 0 0
10191 Q32013 Columbus.OH primetime WWHO 9 0 2 2
total.male.imp total.allpersons.imp spotvalue.male spotvalue.allpersons
10186 47.2 127.7 4830.409 4775.1068
10187 0.0 0.0 NaN NaN
10188 157.9 371.1 4649.746 4505.2608
10189 0.3 0.3 3162.000 3162.0000
10190 3.5 10.3 570.166 591.0231
10191 3.9 15.8 7155.000 4356.4162
问题1:我想添加相同的数据框,例如:spot.count的百分比值。 = spot.count / sum(spot.count)(ii)percent.nullspots.male = nullspots.male / sum(nullspots.male)
但是,当我将其添加到ddply
参数时,我在结果列中得到1(100%)。该值除以自身而不是除以列的总和。
问题2:这是缓慢且谦卑的我接受这可能不是最佳编码。我正在使用带有64位操作系统的16Gb ddr3 RAM的i5-2.6GHz PC。数据集是1M行。
system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
228.13 176.84 416.12
目的是在在线仪表板上显示所有计算的指标,并允许用户使用下拉菜单选择subset_by_market (q , marketname, dp)
。我怎样才能让它更快?
添加样本数据:
`> structure(list(market = c("Local", "Local", "Local", "Local",
"Local", "Local", "Local", "NATIONAL CABLE", "Local", "Local"
), spot.id = c(11248955L, 11262196L, 11946349L, 11625265L, 12929889L,
11259758L, 11517638L, 11599834L, 12527365L, 12930259L), date = structure(c(1375675200,
1376625600, 1390280400, 1383627600, 1401249600, 1375848000, 1380772800,
1383019200, 1397102400, 1401163200), class = c("POSIXct", "POSIXt"
), tzone = ""), hour = c(15, 17, 11, 18, 19, 1, 13, 14, 16, 22
), time = structure(c(0.642361111111111, 0.749305555555556, 0.481944444444444,
0.770138888888889, 0.830555555555556, 0.0597222222222222, 0.582638888888889,
0.597222222222222, 0.675694444444444, 0.930555555555556), format = "h:m:s", class = "times"),
local.date = structure(c(1375675200, 1376625600, 1390280400,
1383627600, 1401249600, 1375848000, 1380772800, 1383019200,
1397102400, 1401163200), class = c("POSIXct", "POSIXt"), tzone = ""),
local.hour = c(15, 17, 11, 18, 18, 0, 13, 14, 15, 22), local.time = structure(c(0.642361111111111,
0.749305555555556, 0.481944444444444, 0.770138888888889,
0.788888888888889, 0.0180555555555556, 0.582638888888889,
0.597222222222222, 0.634027777777778, 0.930555555555556), format = "h:m:s", class = "times"),
vendor = c("Time Warner - Myrtle Beach", "WMYD", "WSBK",
"WDCA", "Comcast - Memphis", "Charter Media - Birmingham",
"WBNA", "G4", "Comcast - Houston", "Comcast - Youngstown"
), station = c("VH-1 MYRTLE BEACH", "WMYD", "WSBK", "WDCA",
"COM MEMPHIS", "FX BIRMINGHAM", "WBNA", "G4", "SPK HOUSTON",
"COM YOUNGSTOWN CC"), male.imp = c(0, 2, 0, 0, 0.6, 0.4,
0, 0, 3.9, 0), women.imp = c(0, 2.5, 0, 2.5, 0.2, 0.6, 0,
0, 4.6, 0.6), allpersons.imp = c(0, 3.5, 0, 2.5, 0.8, 0.8,
0, 0, 7.8, 0.6), hh.imp = c(0, 8.5, 8, 64.5, 1.3, 2.9, 1.3,
15, 13.7, 1), isci = c("IT6140MB", "ITCD78DT", "IT6192BS",
"IT6170WD", "IT6173ME", "IT6162BI", "IT6155LO", "ITES13410",
"IT3917", "IT3921"), creative = c("Eugene Elbert (Bach. Tcom Eng. Tech) :60",
"The Problem Solvers (revised) - IET :60", "Murtech/Kinetic/Integra :60",
"Kevin Bumper/NTSG/Lifetime :60", "NCR/Schlumberger/Sprint (revised) (Bach) :60",
"Skills Gap (revised) /Kevin :60", "Rising Costs60 (Opportunity Scholar - No Nursing)",
"Irina Lund (Bach. ISS) :60", "Augustine Lopez (A. CEET) :30 (no loc)",
"John Ryan Ellis (B. PM/A. CDD) :30 (no loc)"), program = c(NA,
"TYLER PERRY'S MEET THE BROWNS", "THE PEOPLE'S COURT", "Judge Judy",
NA, NA, "Meet the Browns/Are We There Yet/News/Wendy Willia",
"HEROES", "Spike EF Rotator", NA), rate = c(5, 230, 100,
625, 40, 0, 15, 40, 110, 7), R.DMA.NAMES = c("Myrtle.Beach.Florence",
"Detroit", "Boston.Manchester.", "Washington.DC.Hagrstwn.",
"Memphis", "Birmingham.Ann.and.Tusc.", "Louisville", "NATIONAL CABLE",
"Houston", "Youngstown"), date.time = c("2013-08-05 15:25:00",
"2013-08-16 17:59:00", "2014-01-21 11:34:00", "2013-11-05 18:29:00",
"2014-05-28 19:56:00", "2013-08-07 01:26:00", "2013-10-03 13:59:00",
"2013-10-29 14:20:00", "2014-04-10 16:13:00", "2014-05-27 22:20:00"
), daypart = c("afternoon", "evening", "morning", "evening",
"evening", "late fringe", "afternoon", "afternoon", "afternoon",
"primetime"), quarter = structure(c(4L, 4L, 1L, 6L, 3L, 4L,
6L, 6L, 3L, 3L), .Label = c("Q12014", "Q22013", "Q22014",
"Q32013", "Q32014", "Q42013"), class = "factor"), cpi.allpersons = c(96.2179487179487,
79.0114068441065, 35.1219512195122, 82.3322348711803, 30,
0, 138.721804511278, 28.3135215453195, 28.2384088854449,
86.6666666666667), cpi.male = c(750.5, 188.882673751923,
115.959004392387, 144.492639327024, 38.9847715736041, 0,
595.161290322581, 34.7402005469462, 62.010777084515, 156.712328767123
), spotvalue.allpersons = c(0, 276.539923954373, 0, 205.830587177951,
24, 0, 0, 0, 220.25958930647, 52), spotvalue.male = c(0,
377.765347503846, 0, 0, 23.3908629441624, 0, 0, 0, 241.842030629609,
0), nullspot.allpersons = c(1, 0, 1, 0, 0, 0, 1, 1, 0, 0),
nullspot.male = c(1, 0, 1, 1, 0, 0, 1, 1, 0, 1)), .Names = c("market",
"spot.id", "date", "hour", "time", "local.date", "local.hour",
"local.time", "vendor", "station", "male.imp", "women.imp", "allpersons.imp",
"hh.imp", "isci", "creative", "program", "rate", "R.DMA.NAMES",
"date.time", "daypart", "quarter", "cpi.allpersons", "cpi.male",
"spotvalue.allpersons", "spotvalue.male", "nullspot.allpersons",
"nullspot.male"), row.names = c(561147L, 261262L, 89888L, 941010L,
500366L, 65954L, 484053L, 598996L, 380976L, 968615L), class = "data.frame")`
为丑陋的dput
道歉。
答案 0 :(得分:0)
这只回答了我关于使功能更快的第二个问题。根据@beginneR提示,我将函数转换为dplyr
。
subset_by_market <- function (q, marketname, dp) {
subset(df %>% group_by(quarter, R.DMA.NAMES, daypart, station) %>%
summarize (spot.count = length(spot.id), station.investment = sum(rate),
nullspots.male = sum(nullspot.male),
nullspots.allpersons = sum(nullspot.allpersons),
total.male.imp = sum(male.imp),
total.allpersons.imp = sum(allpersons.imp),
spotvalue.male = sum(spotvalue.male),
spotvalue.allpersons = sum(spotvalue.allpersons),
male.imp.per.spot = total.male.imp / spot.count,
allpersons.imp.per.spot = total.allpersons.imp / spot.count,
cost.per.spot = station.investment / spot.count,
male.value.per.spot = spotvalue.male / spot.count,
allpersons.value.per.spot = spotvalue.allpersons / spot.count),
quarter == q & R.DMA.NAMES == marketname & daypart == dp) }
这大大缩短了时间:
> system.time(subset_by_market ("Q32013" , "Albuquerque.Santa.Fe", "late fringe"))
user system elapsed
1.06 0.00 1.09
我在使用dplyr
时遇到的故障是我的数据中名为“time”的列,该列来自包times
的类chron
。我一直收到错误Error: column 'local.time' has unsupported type
。我无法确定这方面的确切工作,所以我只是使用POSIXct
将其转换为df$time <- as.POSIXct(as.character(df$time, format = "%H:%M:%S"))
类。这不是最佳的,因为我使用times
将其转换为chron
的原因是为了保持时间年表而不需要日期或时区。更多相关内容:work around to specifying date range in ifelse statement in R。但是,它解决了眼前的问题。