Question

在我第一次使用R的尝试中，我编写了两个功能，这些功能并不是非常高性能，如果我可以获得一些关于如何使它们更高性能（矢量化）的提示，我将不胜感激。这两个功能最后都带有“测试用例”。

第一个函数采用两个时间序列xts对象x和y，并返回一个系列，其中包含有关x高于/低于y的天数的数据。

require('xts')
require('quantmod')

countDaysBelowOrAbove <- function(x, y) {
    x <- try.xts(x, error=as.matrix)
    y <- try.xts(y, error=as.matrix)

    if(is.xts(x) && is.xts(y)) {
        xy <- cbind(x,y)
    } else {
        xy <- cbind( as.vector(x), as.vector(y) )
    }

    # Count NAs, ensure they're only at beginning of data, then remove.
    xNAs <- sum( is.na(x) )
    yNAs <- sum( is.na(y) )
    NAs <- max( xNAs, yNAs )
    if( NAs > 0 ) {
        if( any( is.na(xy[-(1:NAs),]) ) ) stop("Series contain non-leading NAs")
    }

    resultDaysLower <- x
    resultDaysHigher <- x
    resultDaysLower[!is.na(resultDaysLower)]<-0
    resultDaysHigher[!is.na(resultDaysHigher)]<-0

    series<-cbind(xy, resultDaysLower, resultDaysHigher)
    colnames(series) <- c(names(xy), "cumDaysLower", "cumDaysHigher")

    daysLower = 0
    daysHigher = 0

    for (i in 1:NROW(xy)) {
        if (!(is.na(series[,1][i]) | is.na(series[,2][i]))) {
            if (series[,1][i] >= series[,2][i]) {
                daysLower = 0
                daysHigher = daysHigher + 1
            }
            else {
                daysHigher = 0
                daysLower = daysLower + 1
            }
        }
        else {
            daysLower = 0
            daysHigher = 0
        }
        series$cumDaysLower[i] = daysLower
        series$cumDaysHigher[i] = daysHigher                
    }
    return(series)
}

getSymbols("SPY", from='2005-01-01')
SPYclose = Cl(SPY)

getSymbols("QQQQ", from='2005-01-01')
QQQQclose = Cl(QQQQ)

testData = countDaysBelowOrAbove(SPYclose, QQQQclose)

第二个功能我将非常感谢有关性能优化的帮助如下。该函数将xts对象系列作为参数，使用表示间隔长度的xts对象来计算指定时间的系列最小值。该函数返回计算出的最小系列，其中包含指定窗口的最小计算长度。

minimumWithVaryingLength<-function(series, lengths) {
    series <- try.xts(series, error=as.matrix)
    lengths <- try.xts(lengths, error=as.matrix)

    if(is.xts(series) && is.xts(lengths)) {
        serieslengths <- cbind(series,lengths)
    } else {
        serieslengths <- cbind( as.vector(series), as.vector(lengths) )
    }

    # Count NAs, ensure they're only at beginning of data, then remove.
    seriesNAs <- sum( is.na(series) )
    lengthsNAs <- sum( is.na(lengths) )
    NAs <- max( seriesNAs, lengthsNAs )
    if( NAs > 0 ) {
        if( any( is.na(serieslengths[-(1:NAs),]) ) ) stop("Series contain non-leading NAs")
    }

    result <- series
    result[!is.na(result)]<-0

    for (i in 1:NROW(serieslengths)) {  
        if (lengths[i] > 0) {
            result[i] <- runMin(series, n=lengths[i], cumulative=FALSE)[i]
        }
        else {
            result[i] <- 0
        }
    }

    return(result)
}

getSymbols("SPY", from='2005-01-01')
SPYclose = Cl(SPY)

getSymbols("QQQQ", from='2005-01-01')
QQQQclose = Cl(QQQQ)

numDaysBelow = countDaysBelowOrAbove(SPYclose, QQQQclose)
test = minimumWithVaryingLength(SPYclose, numDaysBelow)

提前感谢您的帮助。

亲切的问候，萨莫。

Answer 1

对于第一个功能，您要查找系列x低于/高于y的累计期数。为此，您可以使用从CumCount()构建的此便捷函数cummax。首先是一些样本数据：

set.seed(1)
x <- sample(1:5,20,T)
y <- sample(1:5,20,T)

CumCount <- function(x) {
  z <- cumsum(x)
  z - cummax(z*(!x))
}

CumLow = CumCount(x<y)
CumHigh = CumCount(x>y)

对于第二次计算，您尝试在x期间找到累积最小 x < y值在每个时间段中。为此，rle函数非常有用（“run-length-encoding”）。

# runs equals the length of each phase (x < y or x > y)
runs <- rle(CumLow > 0)$lengths
# starts is the number of periods prior to each phase...
starts <- c(0,cumsum(runs)[-length(runs)]) 
#... which we use to build "blocks", a list of indices of each phase.
blocks <- mapply( function(x,y) x+y, starts, lapply(runs,seq))
# now apply the cummin function within each block:
# (remember to mask it by CumLow > 0 -- 
#   we only want to do this within the x<y phase)
BlockCumMin <- unlist(sapply(blocks, function(blk) cummin(x[blk]))) * (CumLow > 0)

现在我们把它们放在一起：

  > cbind(x,y, CumLow, CumHigh, BlockCumMin)

      x y CumLow CumHigh BlockCumMin
 [1,] 3 4      1       0           3
 [2,] 4 2      0       1           0
 [3,] 2 2      0       0           0
 [4,] 2 5      1       0           2
 [5,] 4 4      0       0           0
 [6,] 2 2      0       0           0
 [7,] 4 1      0       1           0
 [8,] 1 3      1       0           1
 [9,] 2 5      2       0           1
[10,] 1 3      3       0           1
[11,] 2 5      4       0           1
[12,] 1 4      5       0           1
[13,] 4 2      0       1           0
[14,] 5 3      0       2           0
[15,] 4 1      0       3           0
[16,] 4 1      0       4           0
[17,] 3 4      1       0           3
[18,] 3 1      0       1           0
[19,] 5 3      0       2           0
[20,] 4 4      0       0           0

请注意，此问题与this question

有关

更新。 对于您有series向量的更一般情况，lengths向量（长度与{{1相同） }}），并且您希望生成一个名为series的结果，其中BlockMins是位于BlockMins[i]结尾的lengths[i] series块的最小值，您可以做到以下几点。由于长度是任意的，因此不再是累积最小值;对于每个i，您必须获取i length[i]元素的最小值，其结尾位置为series：

Answer 2

如果您没有处理时间序列设备，如果您有两个向量x和y，并希望“返回一个包含多少天x高于/低于y的数据的系列”，只需比较它们：

# Make up some data
x <- seq(100)
y <- x[sample(x)]
# Compare
x.greater <- sum(x>y)
x.lesser <- sum(x<y)

这个的关键是当你对逻辑矢量求和时，例如（x> y），R将TRUE变为1，将FALSE变为0。

是否可以优化（矢量化）这两个函数以获得更好的性能

2 个答案: