我有以下代码,其目的是获取单个数字数据框列并创建一个列表,每两个向量元素引用数据框的起始和结束索引,其中均值超过0.032。 / p>
示例:
Input: [0.012,0.02,0.032,0.045,0.026,0.06,0.01]
Output [3,5,6,6]
为mean(input(3:5))>0.032
和mean(input(6:6))>0.032
稍微复杂的例子 输入[0,0.08,0.08,0.031,0.031,-0.1] 输出[2,5]
所以我不能只识别0.032以上的项目,据我所知,我需要遍历每个索引。 (因此是while循环)
它对于“小数据帧”运行得非常好,但我试图让它在2,000,000行的数据帧上运行,如果不是更多的话。
我的问题是,当我到达大量行时,它运行得非常慢。具体来说,它会通过值0-100000射击,但之后会显着减慢
activityduration<-function(input)
{
datum<-as.matrix(input)
len=length(datum)
times <-c()
i<-1
while (i <len)
{
if (i>=len)
{
break
}
i<-i+1
if (datum[i]<0.032)
{
next
}
else
{
vect = c(datum[i])
x<-i
while ((mean(vect)>=0.032)){
print(i)
if (i==len)
{
break
}
i<-i+1
boolean <- TRUE
vect <- c(datum[x:i])
}
if (i==len)
{
break
}
if (boolean)
{
times <- c(times, c(x,i-1))
boolean<-FALSE
}
}
}
return(times)
}
我认为是问题所在:
我在第二个while循环中不断增长向量vect
。 (在我的一些数据中vect
可以达到长度= 10000)。这意味着我正在重复更新vect's
大小,导致速度减慢。
我尝试过修正: 最初输入(数据帧)只是作为数据帧访问,我将其更改为矩阵以大幅提高速度。
我将其替换为:
{
newVal = c(datum[i])
x<-i
n<-0
meanValue<-0
while (((meanValue*n+newVal)>=(0.032*(n+1))){
print(i)
if (i==len)
{
break
}
meanValue<-(meanValue*n+newVal)/n+1
n<n+1
i<-i+1
}
在保持相同操作的同时消除了对矢量的需要,但是这会导致更大的减速。很可能是由于执行了大量的操作。
我也尝试过:用700000元素初始化向量vect
,这样就永远不需要增长,但为了做到这一点,我需要改变:
mean(vect)>=0.032
到sum(vect)/n >=0.032
或mean(vect[!vect==0])
这会导致更大的减速。
有人知道如何提高速度吗?
答案 0 :(得分:2)
试试这个:
VariableMean <- function(v, Lim) {options(scipen = 999)
s <- which(v >= Lim)
Len <- length(v)
stInd <- c(s[1L], s[which(diff(s) > 1L)+1L])
size <- length(stInd)
myIndex <- vector(mode="integer", length = 2*size)
bContinue <- FALSE
epsilon <- 2*.Machine$double.eps ## added to account for double precision
i <- r <- 1L; j <- stInd[i]
while (i < size) {
k <- stInd[i+1L]-1L
temp <- j:k
myMeans <- cumsum(v[temp])/(1:length(temp))
myEnd <- temp[max(which(myMeans >= (Lim-epsilon)))]
i <- i+1L
if (myEnd+1L < stInd[i]) {
myIndex[2L*(r-1L)+1L] <- j; myIndex[2L*r] <- myEnd
j <- stInd[i]
r <- r+1L
bContinue <- FALSE
} else {
bContinue <- TRUE
}
}
if (!bContinue) {j <- stInd[size]}
temp <- j:Len
mySums <- cumsum(v[temp])
myEnd <- temp[max(which(mySums >= Lim*(1:length(temp))))]
myIndex[2L*(r-1L)+1L] <- j; myIndex[2L*r] <- myEnd
myIndex[which(myIndex > 0L)]
}
VariableMean(c(0.012,0.02,0.032,0.045,0.026,0.06,0.01), 0.032)
[1] 3 7
VariableMean(c(0,0.08,0.08,0.031,0.031,-0.1), 0.032)
[1] 2 5
下面是一些基准测试和测试(没有比较与@Tensibai提供的算法相等,因为它们没有做同样的事情(即@Tensibai的算法中有重叠)):
测试数据
set.seed(1313)
options(scipen = 999)
HighSamp <- sample(51:75, 10, replace = TRUE)
MidSamp <- sample(36:50, 25, replace = TRUE)
LowSamp <- sample(11:35, 30, replace = TRUE)
MinSamp <- sample(1:10, 35, replace = TRUE)
Samp1 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 20000, replace=TRUE)/1000
Samp2 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 100000, replace=TRUE)/1000
Samp3 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 1000000, replace=TRUE)/1000
平等检查/验证
JoeTest <- VariableMean(Samp1, 0.032)
OPTest <- activityduration(Samp1)
FoehnTest <- activityduration2(Samp1, 0.032)
length(JoeTest)
[1] 5466
length(OPTest)
[1] 5464
tail(JoeTest)
[1] 19966 19967 19971 19993 19999 20000
tail(OPTest) ## OP's algo doesn't handle the end case
[1] 19960 19961 19966 19967 19971 19993
mean(Samp1[19999:20000])
[1] 0.065 ## > 0.032
all(JoeTest[1:length(OPTest)]==OPTest) ## testing equality expect for the end
[1] TRUE
all(JoeTest==FoehnTest)
[1] TRUE
## Ensuring mean of intervals is greater than 0.032
TestMean <- sapply(seq.int(1,length(JoeTest),2), function(x) mean(Samp1[JoeTest[x]:JoeTest[x+1L]]))
all(TestMean >= 0.032)
[1] TRUE
<强>基准强>
microbenchmark(Joseph=VariableMean(Samp1, 0.032),
Foehn=activityduration2(Samp1, 0.032),
Tensibai=Find_indexes(Samp1, 0.032),
OPAlgo=activityduration(Samp1), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 18.191671 19.027055 20.362151 20.917034 21.325900 22.214652 10
Foehn 6.848098 7.238491 8.079705 7.829212 9.083315 9.794315 10
Tensibai 140.924588 142.171712 149.936844 143.188952 148.294031 198.626850 10
OPAlgo 122.381933 123.829385 129.934586 128.347027 136.496846 143.782135 10
microbenchmark(Joseph=VariableMean(Samp2, 0.032),
Foehn=activityduration2(Samp2, 0.032),
Tensibai=Find_indexes(Samp2, 0.032),
OPAlgo=activityduration(Samp2), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 95.38979 99.82943 106.67638 101.45689 102.99117 154.21767 10
Foehn 36.63334 37.75115 39.00842 38.97406 39.97898 41.26387 10
Tensibai 709.57490 725.15861 740.39442 737.45620 747.31374 803.22536 10
OPAlgo 994.43310 996.61208 1025.54683 1030.84784 1046.03234 1063.52655 10
system.time(VariableMean(Samp3, 0.032))
user system elapsed
0.98 0.00 1.00
system.time(activityduration2(Samp3, 0.032))
user system elapsed
0.37 0.00 0.37
system.time(activityduration(Samp3))
user system elapsed
51.37 0.42 51.82
system.time(Find_indexes(Samp3, 0.032))
user system elapsed
7.69 0.00 7.72
在我的机器上,由@Foehn提供的算法速度最快,而且速度比我快(比我快3倍)。 @Tensibai,@ Foehn和我的算法似乎都很好地扩展并且在大数据集中是稳定的(从基准测试中可以看出时间范围(即最小和最大时间之间的差异))。
答案 1 :(得分:1)
Find_indexes <- function(input,target = 0.032) {
l <- length(input)
starts <- which(input >= target)
seqs <- c(0,diff(starts))
contiguousIdx <- which(seqs == 1)
MStarts <- starts[-contiguousIdx]
ranges <- vector('list',length(MStarts))
current <- 1
it <- 0
for (i in MStarts) {
fidx <- i
i <- i + 1
if (i > l) i <- l
while (mean(input[fidx:i]) >= target) {
i <- i + 1
if (i > l) break
}
ranges[[current]] <- fidx:(i - 1)
current <- current + 1
}
ranges
}
set.seed(123)
qinput <- c(0.012,0.02,0.032,0.045,0.026,0.06,0.01)
largeinput <- sample(qinput,1e6,replace = TRUE)
library(microbenchmark)
microbenchmark(Find_indexes(largeinput,0.032),times=3)
这个想法是尽可能地限制循环,所以首先我们搜索输入等于或大于0.032的条目,然后用diff
和wich
搜索'连续'条目(索引只是+1或前任)并建立一个起点向量。
接下来,我们遍历这些起始点并构建索引列表,而从起点到实际位置的平均值仍然是> = target
(默认为0.032)
该函数返回一个索引列表,如果你只需要第一个和最后一个索引,你可以将结果传递给lapply
并使用函数ranges
Benchrck对1e6载体的结果:
Unit: seconds
expr min lq mean median uq max neval
result <- Find_indexes(largeinput, 0.032) 14.24063 14.25262 14.40224 14.2646 14.48304 14.70147 3
我的机器上仍然长达14秒,但这听起来比你实际上要好。 (我没有解决你的解决方案,抱歉)。
有一个缺点,它记录重叠范围。
输出:
> head(largeinput,10)
[1] 0.032 0.060 0.032 0.010 0.010 0.012 0.045 0.010 0.045 0.045
> head(result)
[[1]]
[1] 1 2 3 4
[[2]]
[1] 7
[[3]]
[1] 9 10 11 12 13 14
[[4]]
[1] 12 13 14
[[5]]
[1] 19
[[6]]
[1] 27 28 29
> head(lapply(result,range))
[[1]]
[1] 1 4
[[2]]
[1] 7 7
[[3]]
[1] 9 14
[[4]]
[1] 12 14
[[5]]
[1] 19 19
[[6]]
[1] 27 29
答案 2 :(得分:1)
这是另一种算法,可以产生与@Joseph Wood相同的结果。
activityduration <- function(input, th) {
epsilon <- 2*.Machine$double.eps
a <- input - th
s <- 0; l <- 1; r <- 1; f <- F;
n <- length(input)
res <- vector(mode = "integer", length = 2 * n)
j <- 0
for (i in 1:n) {
s <- s + a[i]
if (s < 0 - epsilon) {
if (f) {
j <- j + 1
res[c(2 * j - 1, 2 * j)] <- c(l, r)
f <- F
} else {
l <- i + 1
}
s <- 0
} else {
r <- i
if (!f) {
f <- T
l <- i
}
}
}
if (f) {
j <- j + 1
res[c(2 * j - 1, 2 * j)] <- c(l, r)
}
return(res[res > 0])
}
测试原始示例
print(activityduration(c(0.012,0.02,0.032,0.045,0.026,0.06,0.01), 0.032))
[1] 3 7
print(activityduration(c(0,0.08,0.08,0.031,0.031,-0.1), 0.032))
[1] 2 5
测试@Joseph Wood的数据
set.seed(1313)
options(scipen = 999)
HighSamp <- sample(51:75, 10, replace = TRUE)
MidSamp <- sample(36:50, 25, replace = TRUE)
LowSamp <- sample(11:35, 30, replace = TRUE)
MinSamp <- sample(1:10, 35, replace = TRUE)
Samp1 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 20000, replace=TRUE)/1000
Samp2 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 100000, replace=TRUE)/1000
Samp3 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 1000000, replace=TRUE)/1000
JoeTest <- VariableMean(Samp1, 0.032)
SomeTest <- activityduration(Samp1, 0.032)
all(JoeTest == SomeTest)
[1] TRUE
效果测试
library("microbenchmark")
microbenchmark(Joseph=VariableMean(Samp1, 0.032), SomeAlgo=activityduration(Samp1, 0.032), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 38.94056 39.54052 40.59358 40.41387 41.83913 42.14377 10
SomeAlgo 38.14466 38.53188 39.47474 38.91653 40.24965 41.72669 10
microbenchmark(Joseph=VariableMean(Samp2, 0.032), SomeAlgo=activityduration(Samp2, 0.032), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
Joseph 201.9639 212.5006 226.1548 217.6033 238.1169 266.1831 10
SomeAlgo 194.1691 200.7253 203.0191 203.6269 205.4802 211.1224 10
system.time(VariableMean(Samp3, 0.032))
user system elapsed
2.12 0.01 2.16
system.time(activityduration(Samp3, 0.032))
user system elapsed
2.08 0.02 2.10
<强>讨论强>
这个算法有一个速度增加,虽然非常温和;
2.算法的核心是避免直接计算均值,而是计算累积和是否改变其符号。