增加可变长度向量的速度意味着R

时间:2016-09-01 05:53:10

标签: r performance

我有以下代码,其目的是获取单个数字数据框列并创建一个列表,每两个向量元素引用数据框的起始和结束索引,其中均值超过0.032。 / p>

示例:

Input: [0.012,0.02,0.032,0.045,0.026,0.06,0.01]
Output [3,5,6,6]

mean(input(3:5))>0.032mean(input(6:6))>0.032

稍微复杂的例子 输入[0,0.08,0.08,0.031,0.031,-0.1] 输出[2,5]

所以我不能只识别0.032以上的项目,据我所知,我需要遍历每个索引。 (因此是while循环)

它对于“小数据帧”运行得非常好,但我试图让它在2,000,000行的数据帧上运行,如果不是更多的话。

我的问题是,当我到达大量行时,它运行得非常慢。具体来说,它会通过值0-100000射击,但之后会显着减慢

activityduration<-function(input)
{
datum<-as.matrix(input)
len=length(datum)
times <-c()
i<-1
while (i <len)
    {
    if (i>=len)
    {
        break
    }
    i<-i+1
        if (datum[i]<0.032)
        {
            next
        }
        else
        {
        vect = c(datum[i])
        x<-i
        while ((mean(vect)>=0.032)){
            print(i)
            if (i==len)
                {
                break
                }
            i<-i+1
            boolean <- TRUE
            vect <- c(datum[x:i])
        }
        if (i==len)
                {
                break
                }
        if (boolean)
            {
            times <- c(times, c(x,i-1))
            boolean<-FALSE
            }
        }
    }
return(times)
}

我认为是问题所在: 我在第二个while循环中不断增长向量vect。 (在我的一些数据中vect可以达到长度= 10000)。这意味着我正在重复更新vect's大小,导致速度减慢。

我尝试过修正: 最初输入(数据帧)只是作为数据帧访问,我将其更改为矩阵以大幅提高速度。

我将其替换为:

{
newVal = c(datum[i])
x<-i
n<-0
meanValue<-0
while (((meanValue*n+newVal)>=(0.032*(n+1))){
    print(i)
    if (i==len)
        {
        break
        }
    meanValue<-(meanValue*n+newVal)/n+1
    n<n+1
    i<-i+1

    }

在保持相同操作的同时消除了对矢量的需要,但是这会导致更大的减速。很可能是由于执行了大量的操作。

我也尝试过:用700000元素初始化向量vect,这样就永远不需要增长,但为了做到这一点,我需要改变:

mean(vect)>=0.032sum(vect)/n >=0.032mean(vect[!vect==0]) 这会导致更大的减速。

有人知道如何提高速度吗?

3 个答案:

答案 0 :(得分:2)

试试这个:

VariableMean <- function(v, Lim) {options(scipen = 999)
    s <- which(v >= Lim)
    Len <- length(v)
    stInd <- c(s[1L], s[which(diff(s) > 1L)+1L])
    size <- length(stInd)
    myIndex <- vector(mode="integer", length = 2*size)
    bContinue <- FALSE
    epsilon <- 2*.Machine$double.eps  ## added to account for double precision

    i <- r <- 1L; j <- stInd[i]
    while (i < size) {
        k <- stInd[i+1L]-1L
        temp <- j:k
        myMeans <- cumsum(v[temp])/(1:length(temp))
        myEnd <- temp[max(which(myMeans >= (Lim-epsilon)))]
        i <- i+1L
        if (myEnd+1L < stInd[i]) {
            myIndex[2L*(r-1L)+1L] <- j; myIndex[2L*r] <- myEnd
            j <- stInd[i]
            r <- r+1L
            bContinue <- FALSE
        } else {
            bContinue <- TRUE
        }
    }

    if (!bContinue) {j <- stInd[size]}
    temp <- j:Len
    mySums <- cumsum(v[temp])
    myEnd <- temp[max(which(mySums >= Lim*(1:length(temp))))]
    myIndex[2L*(r-1L)+1L] <- j; myIndex[2L*r] <- myEnd

    myIndex[which(myIndex > 0L)]
}

VariableMean(c(0.012,0.02,0.032,0.045,0.026,0.06,0.01), 0.032)
[1] 3 7

VariableMean(c(0,0.08,0.08,0.031,0.031,-0.1), 0.032)
[1] 2 5

下面是一些基准测试和测试(没有比较与@Tensibai提供的算法相等,因为它们没有做同样的事情(即@Tensibai的算法中有重叠)):

测试数据

set.seed(1313)
options(scipen = 999)
HighSamp <- sample(51:75, 10, replace = TRUE)
MidSamp <- sample(36:50, 25, replace = TRUE)
LowSamp <- sample(11:35, 30, replace = TRUE)
MinSamp <- sample(1:10, 35, replace = TRUE)
Samp1 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 20000, replace=TRUE)/1000
Samp2 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 100000, replace=TRUE)/1000
Samp3 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 1000000, replace=TRUE)/1000

平等检查/验证

JoeTest <- VariableMean(Samp1, 0.032)
OPTest <- activityduration(Samp1)
FoehnTest <- activityduration2(Samp1, 0.032)

length(JoeTest)
[1] 5466
length(OPTest)
[1] 5464

tail(JoeTest)
[1] 19966 19967 19971 19993 19999 20000
tail(OPTest)     ## OP's algo doesn't handle the end case
[1] 19960 19961 19966 19967 19971 19993

mean(Samp1[19999:20000])
[1] 0.065   ## > 0.032

all(JoeTest[1:length(OPTest)]==OPTest)  ## testing equality expect for the end
[1] TRUE

all(JoeTest==FoehnTest)
[1] TRUE

## Ensuring mean of intervals is greater than 0.032
TestMean <- sapply(seq.int(1,length(JoeTest),2), function(x) mean(Samp1[JoeTest[x]:JoeTest[x+1L]]))
all(TestMean >= 0.032)
[1] TRUE

<强>基准

microbenchmark(Joseph=VariableMean(Samp1, 0.032),
                 Foehn=activityduration2(Samp1, 0.032),
                 Tensibai=Find_indexes(Samp1, 0.032),
                 OPAlgo=activityduration(Samp1), times = 10)
Unit: milliseconds
    expr        min         lq       mean     median         uq        max neval
  Joseph  18.191671  19.027055  20.362151  20.917034  21.325900  22.214652    10
   Foehn   6.848098   7.238491   8.079705   7.829212   9.083315   9.794315    10
Tensibai 140.924588 142.171712 149.936844 143.188952 148.294031 198.626850    10
  OPAlgo 122.381933 123.829385 129.934586 128.347027 136.496846 143.782135    10

microbenchmark(Joseph=VariableMean(Samp2, 0.032),
                 Foehn=activityduration2(Samp2, 0.032),
                 Tensibai=Find_indexes(Samp2, 0.032),
                 OPAlgo=activityduration(Samp2), times = 10)
Unit: milliseconds
    expr       min        lq       mean     median         uq        max neval
  Joseph  95.38979  99.82943  106.67638  101.45689  102.99117  154.21767    10
   Foehn  36.63334  37.75115   39.00842   38.97406   39.97898   41.26387    10
Tensibai 709.57490 725.15861  740.39442  737.45620  747.31374  803.22536    10
  OPAlgo 994.43310 996.61208 1025.54683 1030.84784 1046.03234 1063.52655    10

system.time(VariableMean(Samp3, 0.032))
 user  system elapsed 
0.98    0.00    1.00 

system.time(activityduration2(Samp3, 0.032))
 user  system elapsed 
0.37    0.00    0.37 

system.time(activityduration(Samp3))
 user  system elapsed 
51.37    0.42   51.82

system.time(Find_indexes(Samp3, 0.032))
user  system elapsed 
7.69    0.00    7.72

在我的机器上,由@Foehn提供的算法速度最快,而且速度比我快(比我快3倍)。 @Tensibai,@ Foehn和我的算法似乎都很好地扩展并且在大数据集中是稳定的(从基准测试中可以看出时间范围(即最小和最大时间之间的差异))。

答案 1 :(得分:1)

Find_indexes <- function(input,target = 0.032) {
  l <- length(input)
  starts <- which(input >= target)
  seqs <- c(0,diff(starts))
  contiguousIdx <- which(seqs == 1)
  MStarts <- starts[-contiguousIdx]
  ranges <- vector('list',length(MStarts))
  current <- 1
  it <- 0
  for (i in MStarts) {
    fidx <- i
    i <- i + 1
    if (i > l) i <- l
    while (mean(input[fidx:i]) >= target) {
      i <- i + 1
      if (i > l) break
    }
    ranges[[current]] <- fidx:(i - 1)
    current <- current + 1
  }
  ranges
}

set.seed(123)
qinput <- c(0.012,0.02,0.032,0.045,0.026,0.06,0.01)
largeinput <- sample(qinput,1e6,replace = TRUE)

library(microbenchmark)
microbenchmark(Find_indexes(largeinput,0.032),times=3)

这个想法是尽可能地限制循环,所以首先我们搜索输入等于或大于0.032的条目,然后用diffwich搜索'连续'条目(索引只是+1或前任)并建立一个起点向量。

接下来,我们遍历这些起始点并构建索引列表,而从起点到实际位置的平均值仍然是> = target(默认为0.032)

该函数返回一个索引列表,如果你只需要第一个和最后一个索引,你可以将结果传递给lapply并使用函数ranges

Benchrck对1e6载体的结果:

Unit: seconds
                                      expr      min       lq     mean  median       uq      max neval
 result <- Find_indexes(largeinput, 0.032) 14.24063 14.25262 14.40224 14.2646 14.48304 14.70147     3

我的机器上仍然长达14秒,但这听起来比你实际上要好。 (我没有解决你的解决方案,抱歉)。

有一个缺点,它记录重叠范围。

输出:

> head(largeinput,10)
 [1] 0.032 0.060 0.032 0.010 0.010 0.012 0.045 0.010 0.045 0.045


> head(result)
[[1]]
[1] 1 2 3 4

[[2]]
[1] 7

[[3]]
[1]  9 10 11 12 13 14

[[4]]
[1] 12 13 14

[[5]]
[1] 19

[[6]]
[1] 27 28 29

> head(lapply(result,range))
[[1]]
[1] 1 4

[[2]]
[1] 7 7

[[3]]
[1]  9 14

[[4]]
[1] 12 14

[[5]]
[1] 19 19

[[6]]
[1] 27 29

答案 2 :(得分:1)

这是另一种算法,可以产生与@Joseph Wood相同的结果。

activityduration <- function(input, th) {
    epsilon <- 2*.Machine$double.eps
    a <- input - th
    s <- 0; l <- 1; r <- 1; f <- F;
    n <- length(input)
    res <- vector(mode = "integer", length = 2 * n)
    j <- 0
    for (i in 1:n) {
        s <- s + a[i]
        if (s < 0 - epsilon) {
            if (f) {
                j <- j + 1
                res[c(2 * j - 1, 2 * j)] <- c(l, r)
                f <- F
            } else {
                l <- i + 1
            }
            s <- 0
        } else {
            r <- i
            if (!f) { 
                f <- T
                l <- i
            }
        }
    }
    if (f) {
        j <- j + 1
        res[c(2 * j - 1, 2 * j)] <- c(l, r)
    }
    return(res[res > 0])
}

测试原始示例

print(activityduration(c(0.012,0.02,0.032,0.045,0.026,0.06,0.01), 0.032))
[1] 3 7
print(activityduration(c(0,0.08,0.08,0.031,0.031,-0.1), 0.032))
[1] 2 5

测试@Joseph Wood的数据

set.seed(1313)
options(scipen = 999)
HighSamp <- sample(51:75, 10, replace = TRUE)
MidSamp <- sample(36:50, 25, replace = TRUE)
LowSamp <- sample(11:35, 30, replace = TRUE)
MinSamp <- sample(1:10, 35, replace = TRUE)
Samp1 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 20000, replace=TRUE)/1000
Samp2 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 100000, replace=TRUE)/1000
Samp3 <- sample(c(MinSamp, LowSamp, MidSamp, HighSamp), 1000000, replace=TRUE)/1000


JoeTest <- VariableMean(Samp1, 0.032)
SomeTest <- activityduration(Samp1, 0.032)

all(JoeTest == SomeTest)
[1] TRUE

效果测试

library("microbenchmark")
microbenchmark(Joseph=VariableMean(Samp1, 0.032), SomeAlgo=activityduration(Samp1, 0.032), times = 10)
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
   Joseph 38.94056 39.54052 40.59358 40.41387 41.83913 42.14377    10
 SomeAlgo 38.14466 38.53188 39.47474 38.91653 40.24965 41.72669    10
microbenchmark(Joseph=VariableMean(Samp2, 0.032), SomeAlgo=activityduration(Samp2, 0.032), times = 10)
Unit: milliseconds
     expr      min       lq     mean   median       uq      max neval
   Joseph 201.9639 212.5006 226.1548 217.6033 238.1169 266.1831    10
 SomeAlgo 194.1691 200.7253 203.0191 203.6269 205.4802 211.1224    10

system.time(VariableMean(Samp3, 0.032))
   user  system elapsed 
   2.12    0.01    2.16 
system.time(activityduration(Samp3, 0.032))
   user  system elapsed 
   2.08    0.02    2.10 

<强>讨论
这个算法有一个速度增加,虽然非常温和; 2.算法的核心是避免直接计算均值,而是计算累积和是否改变其符号。