我有一个包含许多“键列”的数据集,以及与只设置了一个键列的组合相关联的值。对于行的vales,可以根据“一键列集”行计算多个键列。
使用常规编程技术,这很简单(但很麻烦),如下所示。我希望在R中有一个更好,更优雅的方法。
在这个例子中,我有三个键,用于组合键值,例如[1,1,0] =我会根据两个主键Val [1,0,0]和Val [0,1,0]来计算该值。在这个使用简单均值的例子中,这是平均值(2,5)= 3.5。
myMatrix <- tribble(
~`1`, ~`2`, ~`3`, ~Val,
0,0,0,1,
1,0,0,2,
2,0,0,2,
0,1,0,5,
1,1,0,NA,
2,1,0,NA,
0,2,0,6,
1,2,0,NA,
2,2,0,NA,
0,0,1,1,
1,0,1,NA,
2,0,1,NA,
0,1,1,NA,
1,1,1,NA,
2,1,1,NA,
0,2,1,NA,
1,2,1,NA,
2,2,1,NA
)
#Filter for NA in the Val col
tmpNARows <- myMatrix %>% filter(is.na(Val)) %>% select(-Val)
#Take the
tmpFirstRow <- TRUE
for (myR in 1:nrow(tmpNARows)) {
#For each row in the NA table
tmpMyNARow<-tmpNARows[myR,]
tmpFirstElement <- TRUE
for (myC in 1:ncol(tmpMyNARow)) {
#find the records that make up this one's parts
#ignore columns with value 0
if (0 != tmpMyNARow[myC]) {
#Make Base Record for lookup
tmpMyBaseRow <- tmpMyNARow
for (myC2 in 1:ncol(tmpMyNARow)) {
if (myC2!=myC) { tmpMyBaseRow[myC2] <- 0 }
}
if(tmpFirstElement == TRUE) {
#Make a new Base table
tmpMyBaseTable <- tmpMyBaseRow
tmpFirstElement <- FALSE
} else {
#Append the Base row to the Base table
tmpMyBaseTable <- union(tmpMyBaseTable, tmpMyBaseRow)
}
}
}
#Calculate the mean and store in as Val
tmpVal <- (left_join(tmpMyBaseTable, myMatrix) %>% summarise(mean(Val)))[[1]]
tmpMyNARowWithVal <- tmpMyNARow %>% mutate(Val = tmpVal)
if (tmpFirstRow == TRUE) {
tmpMyResultMatrix <- tmpMyNARowWithVal
tmpFirstRow <- FALSE
} else {
tmpMyResultMatrix <- union(tmpMyResultMatrix,tmpMyNARowWithVal)
}
}
#filter for non NA
tmpNonNARows <- myMatrix %>% filter(!is.na(Val))
#Add the calculated rows
myCalculatedMatrix <- union(tmpNonNARows, tmpMyResultMatrix)
#lets have a look
myCalculatedMatrix
#the (1,1,0) element is indeed 3.5 so it appears to be working.
预期结果应该看起来像
myCalculatedMatrix %>% arrange_all()
# A tibble: 18 x 4 `1` `2` `3` Val <dbl> <dbl> <dbl> <dbl> 1 0 0 0 1.000000 2 0 0 1 1.000000 3 0 1 0 5.000000 4 0 1 1 3.000000 5 0 2 0 6.000000 6 0 2 1 3.500000 7 1 0 0 2.000000 8 1 0 1 1.500000 9 1 1 0 3.500000 10 1 1 1 2.666667 11 1 2 0 4.000000 12 1 2 1 3.000000 13 2 0 0 2.000000 14 2 0 1 1.500000 15 2 1 0 3.500000 16 2 1 1 2.666667 17 2 2 0 4.000000 18 2 2 1 3.000000
答案 0 :(得分:1)
虽然这个问题已明确标记为dplyr
,但我已经开始使用data.table
解决方案,我希望这个解决方案更加优雅&#34;至少它避免了嵌套的for
循环。
修改:我已添加dplyr
/ tidyr
版data.table
方法。
OP的数据集包含许多&#34;键列&#34;以及与仅设置了一个键列的组合关联的值。然后是第二个数据集,其中设置了多个键列并且缺少值。任务是根据&#34;一个关键列集&#34;来计算缺失值。第一组数据的行。
不幸的是,给定数据myMatrix
包含两个数据集的混合,这增加了问题的复杂性。
data.table
解决方案library(data.table)
# convert to data.table, add column with row numbers for subsequent join
DT <- data.table(myMatrix)[, rn := .I]
# reshape from wide to long format,
# rename column using a self-explanatory name
DT_long <- melt(DT, id.vars = c("rn", "Val"), na.rm = TRUE, value.name = "key")
# extract primary keys
primary_keys <- DT_long[!is.na(Val) & key > 0]
primary_keys
rn Val variable key 1: 2 2 1 1 2: 3 2 1 2 3: 4 5 2 1 4: 7 6 2 2 5: 10 1 3 1
# right join to keep all rows in DT_long
result <- primary_keys[DT_long, on = c("variable", "keys")][
# calculate new Val by aggregating row-wise
, .(calcVal = mean(c(Val, i.Val), na.rm = TRUE)), by = .( rn = i.rn)]
result
rn calcVal 1: 1 1.000000 2: 2 2.000000 3: 3 2.000000 4: 4 5.000000 5: 5 3.500000 6: 6 3.500000 7: 7 6.000000 8: 8 4.000000 9: 9 4.000000 10: 10 1.000000 11: 11 1.500000 12: 12 1.500000 13: 13 3.000000 14: 14 2.666667 15: 15 2.666667 16: 16 3.500000 17: 17 3.000000 18: 18 3.000000
# join calculated values with original table, remove row numbers as no longer needed
result <- result[DT, on = "rn"][, rn := NULL][]
# beautify result for easier comparison
result[, setcolorder(.SD, c(names(myMatrix), "calcVal"))][, setorderv(.SD, names(.SD))]
1 2 3 Val calcVal 1: 0 0 0 1 NaN 2: 0 0 1 1 1.000000 3: 0 1 0 5 5.000000 4: 0 1 1 NA 3.000000 5: 0 2 0 6 6.000000 6: 0 2 1 NA 3.500000 7: 1 0 0 2 2.000000 8: 1 0 1 NA 1.500000 9: 1 1 0 NA 3.500000 10: 1 1 1 NA 2.666667 11: 1 2 0 NA 4.000000 12: 1 2 1 NA 3.000000 13: 2 0 0 2 2.000000 14: 2 0 1 NA 1.500000 15: 2 1 0 NA 3.500000 16: 2 1 1 NA 2.666667 17: 2 2 0 NA 4.000000 18: 2 2 1 NA 3.000000
请注意,上面的data.table
代码是为了解释处理步骤而编写的。使用更多链接重写代码会使其更简洁,因为可以跳过一些中间结果。
dplyr
/ tidyr
解决方案以下代码是&#34;翻译&#34; data.table
解决方案:
library(dplyr)
library(tidyr)
tmpMatrix <- myMatrix %>%
mutate(rn = row_number())
tmpLong <- tmpMatrix %>%
gather(Col, Keys, -Val, -rn) %>%
print()
tmpPrimKeys <- tmpLong %>%
filter(!is.na(Val) & Keys > 0) %>%
select(-rn) %>%
print()
tmpLong %>%
left_join(tmpPrimKeys, by = c("Col", "Keys")) %>%
group_by(rn) %>%
summarise(calcVal = mean(c(Val.x, Val.y), na.rm = TRUE)) %>%
inner_join(tmpMatrix, by = "rn") %>%
select(num_range("", 1:3), Val, calcVal) %>%
arrange_all()
# A tibble: 18 x 5 `1` `2` `3` Val calcVal <dbl> <dbl> <dbl> <dbl> <dbl> 1 0 0 0 1 1.000000 2 0 0 1 1 1.000000 3 0 1 0 5 5.000000 4 0 1 1 NA 3.000000 5 0 2 0 6 6.000000 6 0 2 1 NA 3.500000 7 1 0 0 2 2.000000 8 1 0 1 NA 1.500000 9 1 1 0 NA 3.500000 10 1 1 1 NA 2.666667 11 1 2 0 NA 4.000000 12 1 2 1 NA 3.000000 13 2 0 0 2 2.000000 14 2 0 1 NA 1.500000 15 2 1 0 NA 3.500000 16 2 1 1 NA 2.666667 17 2 2 0 NA 4.000000 18 2 2 1 NA 3.000000