使用R时,我始终牢记:“尽可能避免使用循环”。但是,我现在陷入困境,我无法找到一种奇妙的方式来编码我需要的东西。
为了记录,经过几次评论,我上面的陈述不是正确的陈述,没有必要避免循环来提高效率。
我有两个字符串向量作为输入,让我们称之为a
和b
- 它们只能包含字母"M"
,"I"
和{{1} }。
"D"
我想要的输出是:
a = c("M","I","D","D","M","M","M","M","M","M")
b = c("M","M","M","M","M","M","D","M","M")
以下函数为我提供了这样的输出:
d = c("M","I","D","D","M","M","M","M","I","M","M")
函数逻辑如下,只要my.function <- function(a, b)
{
nrow.df = length(a) + length(which(b=="D"))
my.df = data.frame(a = rep(NA, nrow.df),
b = rep(NA, nrow.df),
d = rep(NA, nrow.df))
my.df$a[1:length(a)] = a
my.df$b[1:length(b)] = b
for (i in 1:nrow.df)
{
if(my.df$a[i] == "D") {
my.df$d[i] = "D"
my.df$b[(i+1):nrow.df] = my.df$b[i:(nrow.df-1)]
} else if (my.df$b[i] == "D") {
my.df$d[i] = "I"
my.df$a[(i+1):nrow.df] = my.df$a[i:(nrow.df-1)]
} else if (my.df$a[i] == "I") {
my.df$d[i] = "I"
} else if (my.df$b[i] == "I") {
my.df$d[i] = "D"
} else {
my.df$d[i] = my.df$a[i]
}
}
return(my.df$d)
}
> d = my.function(a,b)
> d
[1] "M" "I" "D" "D" "M" "M" "M" "M" "I" "M" "M"
中有"D"
,它就会a
放入"D"
并将向量d
移位1,反之亦然,只要b
中有"D"
,它就会b
放入"I"
并将d
换成1.
接下来,如果a
中有"I"
,而a
中只有"D"
,则b
放入"I"
,反之亦然,只要a
中有"I"
而b
中不是"D"
,a
就会"D"
。否则,d
。
这不是一个复杂的功能,但我正在努力使其高效率。我使用mclapply将这个函数应用了数百万次,因此快速实现这样的函数可以节省很多时间。
您推荐使用Rcpp吗?会更快吗?与Cpp数百万时间沟通R是否有任何减速,或者它只是与Rcpp一起自动?
答案 0 :(得分:4)
我没有看到任何简单的方法来避免循环。但是,仍然有一种更有效的方法。问题是,每次遇到角色a
时,您实际上正在移动b
和D
,并且移动这样的向量是O(n)
操作,因此正在运行这个循环的时间实际上是O(n^2)
。
您可以简化代码并获得更好的性能,如下所示:
f<-function(a,b){
aSkipped<-0
bSkipped<-0
d<-rep(0,length(a)+sum(b=="D"))
for(i in 1:length(d)){
if(a[i-aSkipped] == "D") {
d[i] = "D"
bSkipped<-bSkipped+1
} else if (b[i-bSkipped] == "D") {
d[i] = "I"
aSkipped<-aSkipped+1
} else if (a[i-aSkipped] == "I") {
d[i] = "I"
} else if (b[i-bSkipped] == "I") {
d[i] = "D"
} else {
d[i] = a[i-aSkipped]
}
}
d
}
编辑。当输入变大时,您将真正看到大的性能改进。对于小字符串,而不是太多的“D”这个和Ananda Mahto的解决方案大约在同一时间运行:
> set.seed(123)
> a<-c(sample(c("M","I"),500,T))
> b<-c(sample(c("M","I"),500,T))
> a[sample(500,50)]<-"D"
> b[sample(500,50)]<-"D"
> microbenchmark(f(a,b),my.function.v(a,b))
Unit: milliseconds
expr min lq median uq max neval
f(a, b) 4.259970 4.324046 4.368018 4.463925 9.694951 100
my.function.v(a, b) 4.442873 4.497172 4.533196 4.639543 9.901044 100
但对于长度为50000且长度为5000“D”的字符串,差异很大:
> set.seed(123)
> a<-c(sample(c("M","I"),50000,T))
> b<-c(sample(c("M","I"),50000,T))
> a[sample(50000,5000)]<-"D"
> b[sample(50000,5000)]<-"D"
> system.time(f(a,b))
user system elapsed
0.460 0.000 0.463
> system.time(my.function.v(a,b))
user system elapsed
7.056 0.008 7.077
答案 1 :(得分:4)
根据我的评论,如果速度是一个问题,第1步是不要不必要地使用data.frame
。这个答案没有解决循环问题(正如其他人已经说过的那样,如果正确完成,在R中使用循环没有任何问题。)
以下是您的函数的 非常轻微 修改版本,使用vector
而不是data.frame
来存储数据。
my.function.v <- function(a, b) {
nrow.df = length(a) + length(which(b=="D"))
A <- B <- D <- vector(length = nrow.df)
A[1:length(a)] = a
B[1:length(b)] = b
for (i in 1:nrow.df)
{
if(A[i] == "D") {
D[i] = "D"
B[(i+1):nrow.df] = B[i:(nrow.df-1)]
} else if (B[i] == "D") {
D[i] = "I"
A[(i+1):nrow.df] = A[i:(nrow.df-1)]
} else if (A[i] == "I") {
D[i] = "I"
} else if (B[i] == "I") {
D[i] = "D"
} else {
D[i] = A[i]
}
}
return(D)
}
请注意以下速度的相对差异:
library(microbenchmark)
microbenchmark(my.function(a, b), my.function.v(a, b), f(a, b))
# Unit: microseconds
# expr min lq median uq max neval
# my.function(a, b) 1448.416 1490.8780 1511.3435 1547.3880 6674.332 100
# my.function.v(a, b) 157.248 165.8725 171.6475 179.1865 324.722 100
# f(a, b) 168.874 177.5455 184.8775 193.3455 416.551 100
可以看出,@ mrip的功能也比原来的功能好得多。
答案 2 :(得分:2)
好的,这是Rcpp解决方案,正如预期的那样,它大大超过了R解决方案:
rcppFun<-"
CharacterVector fcpp(CharacterVector a,CharacterVector b,int size){
int aSkipped = 0;
int bSkipped = 0;
int asize = a.size();
Rcpp::CharacterVector d(size);
for(int i=0; i<size; i++){
if(i-aSkipped<asize && a[i-aSkipped][0] == 'D') {
d[i] = \"D\";
bSkipped++;
} else if (b[i-bSkipped][0] == 'D') {
d[i] = \"I\";
aSkipped++;
} else if (a[i-aSkipped][0] == 'I') {
d[i] = \"I\";
} else if (b[i-bSkipped][0] == 'I') {
d[i] = \"D\";
} else {
d[i] = a[i-aSkipped];
}
}
return d;
}"
require("Rcpp")
fcpp<-cppFunction(rcppFun)
f3<-function(a,b){
fcpp(a,b,as.integer(length(a)+sum(b=="D")))
}
警告:该功能根本不进行参数检查,因此如果您输入错误数据,则很容易出现seg错误。
如果你要打电话给我,Rcpp肯定是要走的路:
> with(ab(10),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 103.993 107.5155 108.6815 109.7455 178.801 100
f3(a, b) 7.354 8.1305 8.5575 9.1220 18.014 100
f2(a, b) 87.081 90.4150 92.2730 94.2585 146.502 100
my.function.v(a, b) 84.389 86.5140 87.6090 88.8340 109.106 100
> with(ab(100),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 992.082 1018.9850 1032.0180 1071.0690 2784.710 100
f3(a, b) 12.873 14.3605 14.7370 15.5095 35.582 100
f2(a, b) 119.396 125.4405 129.3015 134.9915 1909.930 100
my.function.v(a, b) 769.618 786.7865 802.2920 824.0820 905.737 100
> with(ab(1000),microbenchmark(f(a,b),f3(a,b),f2(a,b),my.function.v(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 9816.295 10065.065 10233.1350 10392.696 12383.373 100
f3(a, b) 66.057 67.869 83.9075 87.231 1167.086 100
f2(a, b) 1637.972 1760.258 2667.6985 3138.229 47610.317 100
my.function.v(a, b) 9692.885 10272.425 10997.2595 11402.602 54315.922 100
> with(ab(10000),microbenchmark(f(a,b),f3(a,b),f2(a,b)))
Unit: microseconds
expr min lq median uq max neval
f(a, b) 101644.922 103311.678 105185.5955 108342.4960 144620.777 100
f3(a, b) 607.702 610.039 669.8515 678.1845 785.415 100
f2(a, b) 221305.641 247952.345 254478.1580 341195.5510 656408.378 100
>
答案 3 :(得分:1)
只是为了展示它是如何完成的,它可以在R中没有循环的情况下完成;这是一种方式。当长度大约为1000或更小时更快但在更大时更慢。一个要点是,你肯定可以在Rcpp中加快速度。
f2 <- function(a,b) {
da <- which(a=="D")
db <- which(b=="D")
dif <- outer(da, db, `<`)
da <- da + rowSums(!dif)
db <- db + colSums(dif)
ia <- which(a=="I")
ia <- ia + colSums(outer(db, ia, `<`))
ib <- which(b=="I")
ib <- ib + colSums(outer(da, ib, `<`))
out <- rep("M", length(a) + length(db))
out[da] <- "D"
out[db] <- "I"
out[ia] <- "I"
out[ib] <- "D"
out
}
用于生成数据
ab <- function(N) {
set.seed(123)
a<-c(sample(c("M","I"),N,TRUE))
b<-c(sample(c("M","I"),N,TRUE))
a[sample(N,N/10)]<-"D"
b[sample(N,N/10)]<-"D"
list(a=a,b=b)
}
时序:
> library(microbenchmark)
> with(ab(10), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
expr min lq median uq max neval
my.function.v(a, b) 79.102 86.9005 89.3680 93.2410 279.761 100
f(a, b) 84.334 91.1055 94.1790 98.2645 215.579 100
f2(a, b) 94.807 101.5405 105.1625 108.9745 226.149 100
> with(ab(100), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: microseconds
expr min lq median uq max neval
my.function.v(a, b) 732.849 750.4480 762.906 845.0835 1953.371 100
f(a, b) 789.380 805.8905 819.022 902.5865 1921.064 100
f2(a, b) 124.442 129.1450 134.543 137.5910 237.498 100
> with(ab(1000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b)))
Unit: milliseconds
expr min lq median uq max neval
my.function.v(a, b) 10.146865 10.387144 10.695895 11.123164 13.08263 100
f(a, b) 7.776286 7.973918 8.266882 8.633563 9.98204 100
f2(a, b) 1.322295 1.355601 1.385302 1.465469 1.85349 100
> with(ab(10000), microbenchmark(my.function.v(a, b), f(a, b), f2(a,b), times=10))
Unit: milliseconds
expr min lq median uq max neval
my.function.v(a, b) 429.4030 435.00373 439.06706 442.51650 465.00124 10
f(a, b) 80.7709 83.71715 85.14887 88.02067 89.00047 10
f2(a, b) 164.7807 170.37608 175.94281 247.78353 251.14653 10