Question

我有一个JxK数据帧M，我想计算以下内容。

对于每一行j，最小化M [j，k]
对于每列k，最小化M [j，k]

然后，让满足第一个的值为向量A_j，第二个为向量A_k。然后，我需要两个向量。设矢量C是矢量排序（c（A_j，A_k））。

长度等于A_j的向量，其中元素i是组合和排序向量C中元素A_j [i]的索引。
长度等于A_k的向量，其中元素i是组合和排序向量C中元素A_k [i]的索引。

对于上面提到的两个排序向量，所有关系都应该给出该值出现在向量C中的第一个索引。也就是说，如果A_j [i]和A_j [i + 1]相等，那么元素i和满足条件＃3的向量中的元素i + 1应该都等于A_j [i]在排序向量C中的位置。

与往常一样，这并不难做到效率低下。但是，在实践中，数据帧非常大，因此低效的解决方案失败了。

作为概念证明，一种解决方案如下：

# Create the dataframe
set.seed(1)
df <- data.frame(matrix(rnorm(50, 8, 2), 10)) # A 10x5 matrix

# Calculate 1 and 2
A.j <- apply(df, 1, min) 
A.k <- apply(df, 2, min)

# Calculate 3 and 4
C <- sort(unname(c(A.j, A.k)))

A.j.indices <- apply(df, 1, function(x) which(x == min(x)))
A.k.indices <- apply(df, 2, function(x) which(x == min(x)))

vec3out <- c()
vec4out <- c()

for(j in 1:nrow(df)){
   rank <- which(C == A.j[j])[1] 
   vec3out <- c(vec3out, rank)
}

for(k in 1:ncol(df)){
   rank <- which(C == A.k[k])[1] 
   vec4out <- c(vec4out, rank)
}

Answer 1

对于初学者，你应该使用矩阵。 Data.frames效率较低（Should I use a data.frame or a matrix?）。然后，我们应该使用apply函数。

让M成为你的data.frame强制转换为矩阵。

M <- as.matrix(M)

minByRow <- apply(M, MARGIN=1, FUN=which.min)
minByCol <- apply(M, MARGIN=2, FUN=which.min)

combinedSorted <- sort(c(minByRow, minByCol))

byRowOutput <- match(minByRow, combinedSorted)
byColOutput <- match(minByCol, combinedSorted)

以下是对100个变量进行100万次观测的结果：

M <- matrix(data=rnorm(100000000), nrow=1000000, ncol=100)


system.time({
  minByRow <- apply(M, MARGIN=1, FUN=which.min)
  minByCol <- apply(M, MARGIN=2, FUN=which.min)

  combinedSorted <- sort(c(minByRow, minByCol))

  byRowOutput <- match(minByRow, combinedSorted)
  byColOutput <- match(minByCol, combinedSorted)
})

   user  system elapsed 
   7.37    0.46    7.93

有效地提取数据帧中每列和每行的最小值和索引，然后按值排序

1 个答案: