我正在尝试将data.frame中的子集转换为data.table
,以便提高代码的性能。但我对data.table
完全不熟悉。这个子集化语句的data.table
类型中的等价物是什么?
for(ii in 1:nplayer)
{
subgame<-subset(game, game$playerA == player[ii] | game$playerB == player[ii])
players[ii,4]<-nrow(subgame)
}
我已经以这种方式定义了新的data.table
gameDT
gameDT<-data.table(game)
setkey(gameDT,playerA,playerB)
输出
>dput(game[1:2,])
structure(list(country = c("New Zealand", "Australia"), tournament = c("WTA Auckland 2012",
"WTA Brisbane 2012"), date = c("2011-12-31 00:00:00", "2011-12-30 00:15:00"
), playerA = c("Schoofs B.", "Lucic M."), playerB = c("Puig M.",
"Tsurenko L."), resultA = c(1L, 1L), resultB = c(2L, 2L), oddA = c("1.8",
"2.17"), oddB = c("1.9", "1.57"), N = c(4L, 3L), Weight = c(1,
0.973608997871031)), .Names = c("country", "tournament", "date",
"playerA", "playerB", "resultA", "resultB", "oddA", "oddB", "N",
"Weight"), row.names = 1:2, class = "data.frame")
答案 0 :(得分:1)
如果这不仅仅是学习lapply
data.table
我认为下面的示例与您尝试的相似,您可以使用lapply
看到相当不错的加速:
set.seed(123)
library(microbenchmark)
game = data.frame(runif(1:50) , playerA = sample(letters[1:5], 50, replace = T), playerB = sample(letters[1:5], 50, replace = T))
player <- union(game$playerA, game$playerB)
nplayer <- length(player)
players <- matrix(player, nrow = nplayer, ncol = 2)
op <- microbenchmark(
LAPPLY = {counts <- lapply(1:nplayer,
function(i) sum(game$playerA == player[i] | game$playerB == player[i]))
names(counts) <- player },
ORIG = {
for(ii in 1:nplayer)
{
subgame<-subset(game, game$playerA == player[ii] | game$playerB == player[ii])
players[ii,2]<-nrow(subgame)
}},
times = 1000)
op
#Unit: microseconds
# expr min lq median uq max neval
# LAPPLY 236.493 251.9985 259.095 269.3205 8323.701 1000
# ORIG 938.194 981.9060 1002.880 1036.6705 61095.935 1000
unlist(counts)
# a c d b e
#19 17 20 20 15
players
# [,1] [,2]
#[1,] "a" "19"
#[2,] "c" "17"
#[3,] "d" "20"
#[4,] "b" "20"
#[5,] "e" "15"