作为R中data.table包的一个相对缺乏经验的用户,我一直在尝试将一个文本列处理成大量指示符列(虚拟变量),每列中有一个1表示特定的子列在字符串列中找到-string。例如,我想处理这个:
ID String
1 a$b
2 b$c
3 c
进入这个:
ID String a b c
1 a$b 1 1 0
2 b$c 0 1 1
3 c 0 0 1
我已经弄清楚如何进行处理,但运行时间比我想要的要长,我怀疑我的代码效率低下。我的代码的可重现版本与虚拟数据如下。请注意,在实际数据中,要搜索的子字符串超过2000个,每个子字符串的长度大约为30个字符,最多可能有几百万行。如果需要,我可以并行化并为问题投入大量资源,但我希望尽可能优化代码。我试过运行Rprof,这表明没有明显的(对我来说)改进。
set.seed(10)
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))
random_string <- function(min_length, max_length, separator) {
selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)
return(selection)
}
dt <- data.table(id = c(1:1000), messy_string = "")
dt[ , messy_string := random_string(2, 5, "$"), by = id]
create_indicators <- function(search_list, searched_string) {
y <- rep(0, length(search_list))
for(j in 1:length(search_list)) {
x <- regexpr(search_list[j], searched_string)
x <- x[1]
y[j] <- ifelse(x > 0, 1, 0)
}
return(y)
}
timer <- proc.time()
indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list))
for(n in 1:nrow(dt)) {
indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)]
}
indicators <- data.table(indicators)
setnames(indicators, elements_list)
dt <- cbind(dt, indicators)
proc.time() - timer
user system elapsed
13.17 0.08 13.29
修改
感谢您的回复 - 这一切都远远优于我的方法。下面的一些速度测试的结果,对我在自己的代码中使用0L和1L的每个函数稍作修改,通过方法将结果存储在单独的表中,并标准化排序。这些是单次速度测试的经过时间(而不是许多测试中的中位数),但是较大的测试需要很长时间。
Number of rows in dt 2K 10K 50K 250K 1M
OP 28.6 149.2 717.0
eddi 5.1 24.6 144.8 1950.3
RS 1.8 6.7 29.7 171.9 702.5
Original GT 1.4 7.4 57.5 809.4
Modified GT 0.7 3.9 18.1 115.2 473.9
GT4 0.1 0.4 2.26 16.9 86.9
很明显,GeekTrader的修改版本的方法是最好的。我对每一步的做法仍然有点模糊,但我可以在闲暇时回顾一下。虽然有点不符合原始问题,如果有人想要解释GeekTrader和Ricardo Saporta的方法更有效地做什么,我和所有访问此页面的人都会感激不尽。我特别想知道为什么有些方法比其他方法更好地扩展。
*****编辑#2 *****
我尝试使用此评论编辑GeekTrader的答案,但这似乎不起作用。我对GT3功能进行了两次非常小的修改,a)对列进行排序,增加了少量时间,b)用0L和1L替换0和1,这样可以加快速度。调用生成的函数GT4。上面的表格已编辑,以便为不同的桌面尺寸添加GT4的时间。显然是一英里的赢家,它具有直观的附加优势。
答案 0 :(得分:5)
更新:版本3
发现更快的方式。此功能也具有高内存效率。
由于在lapply
循环内发生的复制/分配以及结果的rbinding
,之前的功能很慢的主要原因。
在以下版本中,我们预先分配具有适当大小的矩阵,然后在适当的坐标处更改值,这使得它与其他循环版本相比非常快。
funcGT3 <- function() {
#Get list of column names in result
resCol <- unique(dt[, unlist(strsplit(messy_string, split="\\$"))])
#Get dimension of result
nresCol <- length(resCol)
nresRow <- nrow(dt)
#Create empty matrix with dimensions same as desired result
mat <- matrix(rep(0, nresRow * nresCol), nrow = nresRow, dimnames = list(as.character(1:nresRow), resCol))
#split each messy_string by $
ll <- strsplit(dt[,messy_string], split="\\$")
#Get coordinates of mat which we need to set to 1
coords <- do.call(rbind, lapply(1:length(ll), function(i) cbind(rep(i, length(ll[[i]])), ll[[i]] )))
#Set mat to 1 at appropriate coordinates
mat[coords] <- 1
#Bind the mat to original data.table
return(cbind(dt, mat))
}
result <- funcGT3() #result for 1000 rows in dt
result
ID messy_string zn tc sv db yx st ze qs wq oe cv ut is kh kk im le qg rq po wd kc un ft ye if zl zt wy et rg iu
1: 1 zn$tc$sv$db$yx 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2: 2 st$ze$qs$wq 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3: 3 oe$cv$ut$is 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4: 4 kh$kk$im$le$qg 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5: 5 rq$po$wd$kc 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
---
996: 996 rp$cr$tb$sa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
997: 997 cz$wy$rj$he 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
998: 998 cl$rr$bm 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
999: 999 sx$hq$zy$zd 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1000: 1000 bw$cw$pw$rq 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
Ricardo建议的基准测试版本2(这是数据中的250K行):
Unit: seconds
expr min lq median uq max neval
GT2 104.68672 104.68672 104.68672 104.68672 104.68672 1
GT3 15.15321 15.15321 15.15321 15.15321 15.15321 1
版本1 以下是建议答案的第1版
set.seed(10)
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))
random_string <- function(min_length, max_length, separator) {
selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)
return(selection)
}
dt <- data.table(ID = c(1:1000), messy_string = "")
dt[ , messy_string := random_string(2, 5, "$"), by = ID]
myFunc <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
COLS <- do.call(rbind,
lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]]))
)
}
)
)
res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ))
dt <- cbind(dt, res)
for (j in names(dt))
set(dt,which(is.na(dt[[j]])),j,0)
return(dt)
}
create_indicators <- function(search_list, searched_string) {
y <- rep(0, length(search_list))
for(j in 1:length(search_list)) {
x <- regexpr(search_list[j], searched_string)
x <- x[1]
y[j] <- ifelse(x > 0, 1, 0)
}
return(y)
}
OPFunc <- function() {
indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list))
for(n in 1:nrow(dt)) {
indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)]
}
indicators <- data.table(indicators)
setnames(indicators, elements_list)
dt <- cbind(dt, indicators)
return(dt)
}
library(plyr)
plyrFunc <- function() {
indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i)
dt[i,
data.frame(t(as.matrix(table(strsplit(messy_string,
split = "\\$")))))
]))
dt = cbind(dt, indicators)
#dt[is.na(dt)] = 0 #THIS DOESN'T WORK. USING FOLLOWING INSTEAD
for (j in names(dt))
set(dt,which(is.na(dt[[j]])),j,0)
return(dt)
}
<强> BENCHMARK 强>
system.time(res <- myFunc())
## user system elapsed
## 1.01 0.00 1.01
system.time(res2 <- OPFunc())
## user system elapsed
## 21.58 0.00 21.61
system.time(res3 <- plyrFunc())
## user system elapsed
## 1.81 0.00 1.81
版本2:里卡多建议
我在这里发布这个而不是我的回答,因为框架真的是@ GeekTrader的 -Rick _
myFunc.modified <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
## MODIFICATIONS:
# using `rbindlist` instead of `do.call(rbind.. )`
COLS <- rbindlist( lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]])),
# MODICIATION: Not coercing to factors
stringsAsFactors = FALSE
)
}
)
)
# MODIFICATION: Preserve as matrix, the output of tapply
res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )
# FLATTEN into a data.table
resdt <- data.table(r=c(res2))
# FIND & REPLACE NA's of single column
resdt[is.na(r), r:=0L]
# cbind with dt, a matrix, with the same attributes as `res2`
cbind(dt,
matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2)))
}
### Benchmarks:
orig = quote({dt <- copy(masterDT); myFunc()})
modified = quote({dt <- copy(masterDT); myFunc.modified()})
microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L)
# Unit: milliseconds
# expr min lq median uq max
# 1 Modified 895.025 971.0117 1011.216 1189.599 2476.972
# 2 Orig 1953.638 2009.1838 2106.412 2230.326 2356.802
答案 1 :(得分:4)
# split the `messy_string` and create a long table, keeping track of the id
DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val")
# add the columns, initialize to 0
DT2[, c(elements_list) := 0L]
# warning expected, re:adding large ammount of columns
# iterate over each value in element_list, assigning 1's ass appropriate
for (el in elements_list)
DT2[el, c(el) := 1L]
# sum by ID
DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list]
请注意,我们随身携带messy_string
列,因为它比留下它更便宜,然后join
使用ID来取回它。
如果您在最终输出中不需要它,只需将其删除即可。
创建样本数据:
# sample data, using OP's exmple
set.seed(10)
N <- 1e6 # number of rows
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))
messy_string_vec <- random_string_fast(N, 2, 5, "$") # Create the messy strings in a single shot.
masterDT <- data.table(ID = c(1:N), messy_string = messy_string_vec, key="ID") # create the data.table
旁注 一次创建随机字符串并将结果分配为单个列要快得多 比调用函数N次并逐个分配每个函数。
# Faster way to create the `messy_string` 's
random_string_fast <- function(N, min_length, max_length, separator) {
ints <- seq(from=min_length, to=max_length)
replicate(N, paste(sample(elements_list, sample(ints)), collapse=separator))
}
比较四种方法:
以下是设置:
library(data.table); library(plyr); library(microbenchmark)
# data.table method - RS
usingDT.RS <- quote({DT <- copy(masterDT);
DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val"); DT2[, c(elements_list) := 0L]
for (el in elements_list) DT2[el, c(el) := 1L]; DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list]})
# data.table method - GeekTrader
usingDT.GT <- quote({dt <- copy(masterDT); myFunc()})
# data.table method - GeekTrader, modified by RS
usingDT.GT_Mod <- quote({dt <- copy(masterDT); myFunc.modified()})
# ply method from below
usingPlyr.eddi <- quote({dt <- copy(masterDT); indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i) dt[i, data.frame(t(as.matrix(table(strsplit(messy_string, split = "\\$"))))) ]));
dt = cbind(dt, indicators); dt[is.na(dt)] = 0; dt })
以下是基准测试结果:
microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), usingPlyr.eddi=eval(usingPlyr.eddi), times=5L)
On smaller data:
N = 600
Unit: milliseconds
expr min lq median uq max
1 usingDT.GT 1189.7549 1198.1481 1200.6731 1202.0972 1203.3683
2 usingDT.GT_Mod 581.7003 591.5219 625.7251 630.8144 650.6701
3 usingDT.RS 2586.0074 2602.7917 2637.5281 2819.9589 3517.4654
4 usingPlyr.eddi 2072.4093 2127.4891 2225.5588 2242.8481 2349.6086
N = 1,000
Unit: seconds
expr min lq median uq max
1 usingDT.GT 1.941012 2.053190 2.196100 2.472543 3.096096
2 usingDT.RS 3.107938 3.344764 3.903529 4.010292 4.724700
3 usingPlyr 3.297803 3.435105 3.625319 3.812862 4.118307
N = 2,500
Unit: seconds
expr min lq median uq max
1 usingDT.GT 4.711010 5.210061 5.291999 5.307689 7.118794
2 usingDT.GT_Mod 2.037558 2.092953 2.608662 2.638984 3.616596
3 usingDT.RS 5.253509 5.334890 6.474915 6.740323 7.275444
4 usingPlyr.eddi 7.842623 8.612201 9.142636 9.420615 11.102888
N = 5,000
expr min lq median uq max
1 usingDT.GT 8.900226 9.058337 9.233387 9.622531 10.839409
2 usingDT.GT_Mod 4.112934 4.293426 4.460745 4.584133 6.128176
3 usingDT.RS 8.076821 8.097081 8.404799 8.800878 9.580892
4 usingPlyr.eddi 13.260828 14.297614 14.523016 14.657193 16.698229
# dropping the slower two from the tests:
microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), times=6L)
N = 10,000
Unit: seconds
expr min lq median uq max
1 usingDT.GT_Mod 8.426744 8.739659 8.750604 9.118382 9.848153
2 usingDT.RS 15.260702 15.564495 15.742855 16.024293 16.249556
N = 25,000
... (still running)
基准测试中使用的函数:
# original random string function
random_string <- function(min_length, max_length, separator) {
selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)
return(selection)
}
# GeekTrader's function
myFunc <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
COLS <- do.call(rbind,
lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]]))
)
}
)
)
res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ))
dt <- cbind(dt, res)
for (j in names(dt))
set(dt,which(is.na(dt[[j]])),j,0)
return(dt)
}
# Improvements to @GeekTrader's `myFunc` -RS '
myFunc.modified <- function() {
ll <- strsplit(dt[,messy_string], split="\\$")
## MODIFICATIONS:
# using `rbindlist` instead of `do.call(rbind.. )`
COLS <- rbindlist( lapply(1:length(ll),
function(i) {
data.frame(
ID= rep(i, length(ll[[i]])),
COL = ll[[i]],
VAL= rep(1, length(ll[[i]])),
# MODICIATION: Not coercing to factors
stringsAsFactors = FALSE
)
}
)
)
# MODIFICATION: Preserve as matrix, the output of tapply
res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )
# FLATTEN into a data.table
resdt <- data.table(r=c(res2))
# FIND & REPLACE NA's of single column
resdt[is.na(r), r:=0L]
# cbind with dt, a matrix, with the same attributes as `res2`
cbind(dt,
matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2)))
}
### Benchmarks comparing the two versions of GeekTrader's function:
orig = quote({dt <- copy(masterDT); myFunc()})
modified = quote({dt <- copy(masterDT); myFunc.modified()})
microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L)
# Unit: milliseconds
# expr min lq median uq max
# 1 Modified 895.025 971.0117 1011.216 1189.599 2476.972
# 2 Orig 1953.638 2009.1838 2106.412 2230.326 2356.802
答案 2 :(得分:3)
这是一种更新的方法,使用cSplit_e()
包中的splitstackshape
。
library(splitstackshape)
cSplit_e(dt, split.col = "String", sep = "$", type = "character",
mode = "binary", fixed = TRUE, fill = 0)
# ID String String_a String_b String_c
#1 1 a$b 1 1 0
#2 2 b$c 0 1 1
#3 3 c 0 0 1
答案 3 :(得分:2)
使用rbind.fill
,这是一个快10倍的版本。
library(plyr)
indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i)
dt[i,
data.frame(t(as.matrix(table(strsplit(messy_string,
split = "\\$")))))
]))
dt = cbind(dt, indicators)
# dt[is.na(dt)] = 0
# faster NA replace (thanks geektrader)
for (j in names(dt))
set(dt, which(is.na(dt[[j]])), j, 0L)
答案 4 :(得分:2)
以下是使用rapply
和table
的方法。
我肯定会比使用表格略快一些,但它仍然比@ricardo的myfunc.Modified
略快;答案
# a copy with enough column pointers available
dtr <- alloc.col(copy(dt) ,1000L)
rapplyFun <- function(){
ll <- strsplit(dtr[, messy_string], '\\$')
Vals <- rapply(ll, classes = 'character', f= table, how = 'replace')
Names <- unique(rapply(Vals, names))
dtr[, (Names) := 0L]
for(ii in seq_along(Vals)){
for(jj in names(Vals[[ii]])){
set(dtr, i = ii, j = jj, value =Vals[[ii]][jj])
}
}
}
microbenchmark(myFunc.modified(), rapplyFun(),times=5)
Unit: milliseconds
# expr min lq median uq max neval
# myFunc.modified() 395.1719 396.8706 399.3218 400.6353 401.1700 5
# rapplyFun() 308.9103 309.5763 309.9368 310.2971 310.3463 5
答案 5 :(得分:1)
这是另一个解决方案,它构造一个稀疏矩阵对象而不是你拥有的对象。这减少了很多时间和记忆。
它会生成有序结果,即使转换为data.table
,它也会比使用0L
和1L
的GT3快,而且无需重新排序(这可能是因为我使用了不同的方法来达到所需的坐标 - 我没有通过GT3算法),但如果你不进行转换并将其保持为稀疏矩阵,它比GT3快10-20倍(并且内存占用空间小得多)。
library(Matrix)
strings = strsplit(dt$messy_string, split = "$", fixed = TRUE)
element.map = data.table(el = elements_list, n = seq_along(elements_list), key = "el")
tmp = data.table(n = seq_along(strings), each = unlist(lapply(strings, length)))
rows = tmp[, rep(n, each = each), by = n][, V1]
cols = element.map[J(unlist(strings))][,n]
dt.sparse = sparseMatrix(rows, cols, x = 1,
dims = c(max(rows), length(elements_list)))
# optional, should be avoided until absolutely necessary
dt = cbind(dt, as.data.table(as.matrix(dt.sparse)))
setnames(dt, c('id', 'messy_string', elements_list))
想法是拆分为字符串,然后使用data.table
作为地图对象将每个子字符串映射到其正确的列位置。从那里开始,只需要正确计算行并填充矩阵即可。