是否有一种优雅的方法来使用数据集中“主键值”的组合来计算组合键值的值?

时间:2017-08-10 13:13:24

标签: r dplyr

注意:这不是一个时间严重的问题,可以很容易地解决动物园(或者至少,我不明白如何动物园这个问题:()

我有一个包含许多“键列”的数据集,以及与只设置了一个键列的组合相关联的值。对于行的vales,可以根据“一键列集”行计算多个键列。

使用常规编程技术,这很简单(但很麻烦),如下所示。我希望在R中有一个更好,更优雅的方法。

在这个例子中,我有三个键,用于组合键值,例如[1,1,0] =我会根据两个主键Val [1,0,0]和Val [0,1,0]来计算该值。在这个使用简单均值的例子中,这是平均值(2,5)= 3.5。

myMatrix <- tribble(
  ~`1`, ~`2`, ~`3`, ~Val,
  0,0,0,1,
  1,0,0,2,
  2,0,0,2,
  0,1,0,5,
  1,1,0,NA,
  2,1,0,NA,
  0,2,0,6,
  1,2,0,NA,
  2,2,0,NA,
  0,0,1,1,
  1,0,1,NA,
  2,0,1,NA,
  0,1,1,NA,
  1,1,1,NA,
  2,1,1,NA,
  0,2,1,NA,
  1,2,1,NA,
  2,2,1,NA
  )

#Filter for NA in the Val col
tmpNARows <- myMatrix %>% filter(is.na(Val)) %>% select(-Val)
#Take the 
tmpFirstRow <- TRUE
for (myR in 1:nrow(tmpNARows)) {
  #For each row in the NA table
  tmpMyNARow<-tmpNARows[myR,]
  tmpFirstElement <- TRUE
  for (myC in 1:ncol(tmpMyNARow)) {
    #find the records that make up this one's parts 
    #ignore columns with value 0
    if (0 != tmpMyNARow[myC]) { 
      #Make Base Record for lookup
      tmpMyBaseRow <- tmpMyNARow
      for (myC2 in 1:ncol(tmpMyNARow)) {
        if (myC2!=myC) { tmpMyBaseRow[myC2] <- 0 }
      }
      if(tmpFirstElement == TRUE) {
        #Make a new Base table
        tmpMyBaseTable <- tmpMyBaseRow       
        tmpFirstElement <- FALSE
      } else {
        #Append the Base row to the Base table
        tmpMyBaseTable <- union(tmpMyBaseTable, tmpMyBaseRow)
      }
    }
  }
  #Calculate the mean and store in as Val
  tmpVal <- (left_join(tmpMyBaseTable, myMatrix) %>% summarise(mean(Val)))[[1]]
  tmpMyNARowWithVal <- tmpMyNARow %>% mutate(Val = tmpVal)
  if (tmpFirstRow == TRUE) {
    tmpMyResultMatrix <- tmpMyNARowWithVal
    tmpFirstRow <- FALSE
  } else {
    tmpMyResultMatrix <- union(tmpMyResultMatrix,tmpMyNARowWithVal)
  }
}
#filter for non NA
tmpNonNARows <- myMatrix %>% filter(!is.na(Val))
#Add the calculated rows
myCalculatedMatrix <- union(tmpNonNARows, tmpMyResultMatrix)

#lets have a look
myCalculatedMatrix
#the (1,1,0) element is indeed 3.5 so it appears to be working.

预期结果应该看起来像

myCalculatedMatrix %>% arrange_all()
# A tibble: 18 x 4
     `1`   `2`   `3`      Val
   <dbl> <dbl> <dbl>    <dbl>
 1     0     0     0 1.000000
 2     0     0     1 1.000000
 3     0     1     0 5.000000
 4     0     1     1 3.000000
 5     0     2     0 6.000000
 6     0     2     1 3.500000
 7     1     0     0 2.000000
 8     1     0     1 1.500000
 9     1     1     0 3.500000
10     1     1     1 2.666667
11     1     2     0 4.000000
12     1     2     1 3.000000
13     2     0     0 2.000000
14     2     0     1 1.500000
15     2     1     0 3.500000
16     2     1     1 2.666667
17     2     2     0 4.000000
18     2     2     1 3.000000

1 个答案:

答案 0 :(得分:1)

虽然这个问题已明确标记为dplyr,但我已经开始使用data.table解决方案,我希望这个解决方案更加优雅&#34;至少它避免了嵌套的for循环。

修改:我已添加dplyr / tidyrdata.table方法。

OP的数据集包含许多&#34;键列&#34;以及与仅设置了一个键列的组合关联的值。然后是第二个数据集,其中设置了多个键列并且缺少值。任务是根据&#34;一个关键列集&#34;来计算缺失值。第一组数据的行。

不幸的是,给定数据myMatrix包含两个数据集的混合,这增加了问题的复杂性。

data.table解决方案

library(data.table)

# convert to data.table, add column with row numbers for subsequent join
DT <- data.table(myMatrix)[, rn := .I]
# reshape from wide to long format, 
# rename column using a self-explanatory name
DT_long <- melt(DT, id.vars = c("rn", "Val"), na.rm  = TRUE, value.name = "key")
# extract primary keys
primary_keys <- DT_long[!is.na(Val) & key > 0]
primary_keys
   rn Val variable key
1:  2   2        1   1
2:  3   2        1   2
3:  4   5        2   1
4:  7   6        2   2
5: 10   1        3   1
# right join to keep all rows in DT_long
result <- primary_keys[DT_long, on = c("variable", "keys")][
  # calculate new Val by aggregating row-wise
  , .(calcVal = mean(c(Val, i.Val), na.rm = TRUE)), by = .( rn = i.rn)]        
result
    rn  calcVal
 1:  1 1.000000
 2:  2 2.000000
 3:  3 2.000000
 4:  4 5.000000
 5:  5 3.500000
 6:  6 3.500000
 7:  7 6.000000
 8:  8 4.000000
 9:  9 4.000000
10: 10 1.000000
11: 11 1.500000
12: 12 1.500000
13: 13 3.000000
14: 14 2.666667
15: 15 2.666667
16: 16 3.500000
17: 17 3.000000
18: 18 3.000000
# join calculated values with original table, remove row numbers as no longer needed
result <- result[DT, on = "rn"][, rn := NULL][]

# beautify result for easier comparison
result[, setcolorder(.SD, c(names(myMatrix), "calcVal"))][, setorderv(.SD, names(.SD))]
    1 2 3 Val  calcVal
 1: 0 0 0   1      NaN
 2: 0 0 1   1 1.000000
 3: 0 1 0   5 5.000000
 4: 0 1 1  NA 3.000000
 5: 0 2 0   6 6.000000
 6: 0 2 1  NA 3.500000
 7: 1 0 0   2 2.000000
 8: 1 0 1  NA 1.500000
 9: 1 1 0  NA 3.500000
10: 1 1 1  NA 2.666667
11: 1 2 0  NA 4.000000
12: 1 2 1  NA 3.000000
13: 2 0 0   2 2.000000
14: 2 0 1  NA 1.500000
15: 2 1 0  NA 3.500000
16: 2 1 1  NA 2.666667
17: 2 2 0  NA 4.000000
18: 2 2 1  NA 3.000000

请注意,上面的data.table代码是为了解释处理步骤而编写的。使用更多链接重写代码会使其更简洁,因为可以跳过一些中间结果。

dplyr / tidyr解决方案

以下代码是&#34;翻译&#34; data.table解决方案:

library(dplyr)
library(tidyr)

tmpMatrix <- myMatrix %>% 
  mutate(rn = row_number()) 
tmpLong <- tmpMatrix  %>% 
  gather(Col, Keys, -Val, -rn) %>% 
  print()
tmpPrimKeys <- tmpLong %>% 
  filter(!is.na(Val) & Keys > 0) %>% 
  select(-rn) %>% 
  print()   
tmpLong %>% 
  left_join(tmpPrimKeys, by = c("Col", "Keys")) %>% 
  group_by(rn) %>% 
  summarise(calcVal = mean(c(Val.x, Val.y), na.rm = TRUE)) %>% 
  inner_join(tmpMatrix, by = "rn") %>% 
  select(num_range("", 1:3), Val, calcVal) %>% 
  arrange_all()
# A tibble: 18 x 5
     `1`   `2`   `3`   Val  calcVal
   <dbl> <dbl> <dbl> <dbl>    <dbl>
 1     0     0     0     1 1.000000
 2     0     0     1     1 1.000000
 3     0     1     0     5 5.000000
 4     0     1     1    NA 3.000000
 5     0     2     0     6 6.000000
 6     0     2     1    NA 3.500000
 7     1     0     0     2 2.000000
 8     1     0     1    NA 1.500000
 9     1     1     0    NA 3.500000
10     1     1     1    NA 2.666667
11     1     2     0    NA 4.000000
12     1     2     1    NA 3.000000
13     2     0     0     2 2.000000
14     2     0     1    NA 1.500000
15     2     1     0    NA 3.500000
16     2     1     1    NA 2.666667
17     2     2     0    NA 4.000000
18     2     2     1    NA 3.000000