以最少的计算时间遍历每一行和每一列

时间:2019-12-17 20:18:49

标签: r foreach data.table apply

我有一个包含79003行和97列的数据集。我的数据集如下所示:

col1    col2          1  2         3          4         5        6        7        8         
str_11  str_44 0.2064191 0 0.6061358 0.92798677 2.7899374 1.098612 1.395511 0.000000 
str_11  str_22 0.9044563 0 1.7917595 0.00000000 1.1412787 1.504077 1.008228 0.000000 
str_11  str_18 0.8266786 0 0.5389965 0.81676114 0.2787134 0.000000 3.295837 0.000000 
str_11  str_1 0.8176492 0 5.0673306 4.45461768 0.8664189 6.549293 1.686399 2.079442 

我正在尝试遍历每一行和每一列。我要计算列方向的最小值和最大值,然后执行以下计算:

for (i in 1:nrow(log_trans2)){
    for (j in 3:ncol(log_trans2)){
        log_trans2[i, j] = log_trans2[i, ..j] -
          min(log_trans2[i, 3:ncol(log_trans2)]) / 
          (max(log_trans2[i, 3:ncol(log_trans2)]) - min(log_trans2[i, 3:ncol(log_trans2)]))
       }}

我在收到错误

后添加了..j
  

[.data.table(log_trans2,i,j)中的错误:j(第二个参数   [...]内是单个符号,但未找到列名“ j”。   也许您打算使用DT [,..j]。与data.frame的区别是   在常见问题解答1.1中进行了仔细的解释并进行了解释

,但执行时间更长(例如数小时)。如何使用foreach或Apply函数减少计时?

公式:

=(r-min(col))/(max(col)-min(col))

预期结果将是

    col1    col2    1   2   3   4   5   6   7   8
   Str_11   Str_44  0.029847796820572   0   0.080259104746805   0.11295123566895    0.405795371744574   0.138441206009843   0.167481921848205   0
   Str_11   Str_22  0.130782597207229   0   0.237248831160936   0   0.165998575836442   0.189535736027761   0.121002270272179   0
   Str_11   Str_18  0.119536094709514   0   0.071369116220582   0.099413248590107   0.040538762756246   0   0.39554907557078    0
   Str_11   Str_1   0.118230460268521   0   0.670970792433184   0.54220015449667    0.126020313332003   0.825306647460321   0.202392768285567   0.251126401405454

2 个答案:

答案 0 :(得分:2)

这是避免循环的一种方法:

#Exclude columns which are not required for calculation
temp <- as.matrix(df[, -c(1:2)])
#Get column-wise minimum
min_vals <- matrixStats::colMins(temp)
#Get column-wise maximum
max_vals <- matrixStats::colMaxs(temp)
#Subtract minimum value of column from each element
s1 <- sweep(temp, 2, min_vals, `-`)
#Divide it by max - min
sweep(s1, 2, (max_vals - min_vals), `/`)

答案 1 :(得分:2)

也许您可以将以下代码与base R一起使用。

这个想法是首先获取每列的范围,然后将这些列作为可操纵的矩阵。我相信这比使用for循环(来自矩阵处理的好处)要快得多,即

r <- apply(d,2,range)
df[-c(1:2)] <- t((t(df[-c(1:2)]) - r[1,])/as.vector(diff(r)))

如此

> df
     ol1   col2         1   2          3         4         5         6         7 8
1 str_11 str_44 0.0000000 NaN 0.01482649 0.2083202 1.0000000 0.1677451 0.1692960 0
2 str_11 str_22 1.0000000 NaN 0.27664986 0.0000000 0.3434840 0.2296549 0.0000000 0
3 str_11 str_18 0.8885766 NaN 0.00000000 0.1833516 0.0000000 0.0000000 1.0000000 0
4 str_11  str_1 0.8756412 NaN 1.00000000 1.0000000 0.2340315 1.0000000 0.2964541 1

数据

> dput(df)
structure(list(ol1 = structure(c(1L, 1L, 1L, 1L), .Label = "str_11", class = "factor"), 
    col2 = structure(4:1, .Label = c("str_1", "str_18", "str_22", 
    "str_44"), class = "factor"), `1` = c(0.2064191, 0.9044563, 
    0.8266786, 0.8176492), `2` = c(0L, 0L, 0L, 0L), `3` = c(0.6061358, 
    1.7917595, 0.5389965, 5.0673306), `4` = c(0.92798677, 0, 
    0.81676114, 4.45461768), `5` = c(2.7899374, 1.1412787, 0.2787134, 
    0.8664189), `6` = c(1.098612, 1.504077, 0, 6.549293), `7` = c(1.395511, 
    1.008228, 3.295837, 1.686399), `8` = c(0, 0, 0, 2.079442)), class = "data.frame", row.names = c(NA, 
-4L))