Question

我希望有人可以提供帮助。

所以，我有一个大约400K行的数据集，我不得不使用多个循环，并且在使用R运行时代码非常慢，我已经阅读了关于矢量化但是我看不出它是怎么回事实施。我将给出一个类似的例子，以便你有一些背景。

假设我有一组数据如下：

Name | TimeStamp (UNIX time)
A      15
B      16
C      16
A      20
D      21

基本上，名称可以重复，时间会延长（尽管在特定时间内可能有多个条目）。

这就是我现在编写代码的方式（我对R和编码整体缺乏经验，所以请原谅任何看似不标准的代码。）

dataIn <- read.csv('above table in csv format..') 
timeStamp <- as.numeric(dataIn$TimeStamp)
names <- as.vector(dataIn$Name)

#Unique list of names
uNames <- unique(names)
n <- length(uNames)

#Matrix data structure
countMatrix <- matrix(0, ncol = n, nrow = n)

threshold <- 5

#This is where it starts to take very long to run..(see: days)
for (i in 1:length(uNames)) {
  print (counter)
  counter <- counter + 1
  currentName <- uNames[i]
  for (j in 1:length(timeStamp)){
    currentTime = timeStamp[j]
    if (names[j]==currentName){
      futureIndex <- j
      futureTime <- currentTime
      tempLog <- numeric()
      while (((futureTime-currentTime) <= threshold) & (futureIndex <= length(timeStamp)) )  {
        futureName <- codes[futureIndex]
        futureNameIndex <- which(uNames %in% futureName)
        if (futureName %in% tempLog == FALSE){
          tempLog <- c(futureName, tempLog)
          countMatrix[i, futureNameIndex] = countMatrix[i, futureNameIndex] + 1
          futureIndex <- futureIndex + 1
          futureTime <- timeStamp[futureIndex]
        }
        else{
          futureIndex <- futureIndex + 1
          futureTime <- timeStamp[futureIndex]
        }
      }
    }
  }
}

#This bit is almost instantaneous
for (row in 1:length(uCodes)) {
  standardNameNumberr <- countMatrix[row, row]
  countMatrix[row,] <- countMatrix[row,]/standardNameNumber
}
print ('normalisation complete..')
colnames(countMatrix) <- uCodes
df = as.data.frame(countMatrix)

所以只是为了简要解释代码的作用（代码中并不完全清楚......）：

我想回答这个问题，假设某个名字被调用，在某个阈值内调用另一个名字（比如B）的概率是多少？我的想法是，我为A回答这个问题，然后所有其他名称重复其他名称的过程。

有谁知道如何更快地制作此代码？因为现在它需要几天时间才能运行大约800k行并且我的完整数据集要大得多，我知道多次遍历大型数据集可能非常耗时，但必须有一些方法来加速这个过程？任何帮助将不胜感激。

因此，在开始时给出的表的输出将是'countMatrix'（使用1秒的阈值）：

    A    B    C    D    
A   1    0.5  0.5  0.5

B   0    1    1    0

C   0    0    1    1

D   0    0    0    1

此表的读取方式是行是事件，列是以下事件。因此，从第一行开始读取，假设已经调出A，则在1秒内调出B的概率为0.5，C和D的概率相同。名称在1秒内被调出的概率显然是明显的1.请注意，重复不计算，例如，如果B在15秒和16秒被调出，我们只关心它在时间阈值内至少关闭了一次。

我希望这更清楚。

如何让R脚本更快？

0 个答案: