给定一个包含两列的数据框:
因此,必须在第二列中找到所有索引的所有长度,并将结果放在第三列中。 请看上面的例子,我们搜索1637的长度并获得1835:
> df$length[1637]
[1] 1835
head(df)
length findLengthOf
1 6434 1637,386....
2 4272 4322,414....
3 7338 2052,639....
4 4932 190,1567....
5 2397 8963,844....
6 4405 103,4346....
head(df)
length findLengthOf result
1 6434 1637,386.... 1835, 2404, 4689
2 4272 4322,414.... 1184, 2721, 7215
3 7338 2052,639.... 5253, 2998, 6153
4 4932 190,1567.... 2931, 6496, 7784
5 2397 8963,844.... 3796, 3488, 6555
6 4405 103,4346.... 1662, 5481, 1244
set.seed(123)
df <- data.frame(length = sample(1e4),
findLengthOf = I(replicate(1e4, paste(sample(1:10000,1),sample(1:10000,1),sample(1:10000,1),sep=","), simplify = FALSE)))
df$result=lapply(lapply(df$findLengthOf,strsplit,split=","), function(x){df[x[[1]],"length"]})
代码有效,但需要很长时间。我怎样才能提高速度? 也是为什么
head(lapply(df$findLengthOf,strsplit,split=","))
总是返回这个奇怪的列表列表:
[[1]]
[[1]][[1]]
[1] "7744" "1346" "4626"
有没有办法避免这些双括号? 非常感谢任何回复!
来自David的建议(set fixed = T):
> ptm <- proc.time()
> df$result=lapply(lapply(df$findLengthOf,strsplit,split=",",fixed=T), function(x){df[x[[1]],"length"]})
> proc.time() - ptm
user system elapsed
17.220 0.000 17.147
> ptm <- proc.time()
> df$result=lapply(lapply(df$findLengthOf,strsplit,split=","), function(x){df[x[[1]],"length"]})
> proc.time() - ptm
user system elapsed
17.260 0.000 17.142
答案 0 :(得分:1)
这是一个完全vectotorized解决方案,但可能内存昂贵。我没有经过性能测试
library(data.table)
res <- matrix(df$length[unlist(setDT(df)[,
tstrsplit(findLengthOf, ",", fixed = TRUE, type.convert = TRUE)])],
nrow = nrow(df))
df$result <- as.list(as.data.frame(t(res)))