根据查找表(具有多个列)在数据框中查找(和替换)值

时间:2013-05-20 20:32:09

标签: r

下面更新

:原始

我正在尝试找到最优雅(简单和简洁)的方法,根据匹配另一个数据框中的两列来替换某些列的值。

这是包含我想要替换的列的表(基于它们包含的值)。

> cost.table
  Identifier Phase.0.Difficulty Phase.1.Complexity Phase.2.Complexity Phase.3.Complexity Phase.4.Complexity Phase.5.Complexity
1        FS1                Low                Low                Low             Medium             Medium               High
2        FS2               High               High               High             Medium             Medium             Medium
3        FS3               High                Low                Low               High               High               High
4        FS4               High             Medium             Medium             Medium             Medium             Medium
5        FS5               High             Medium             Medium               High             Medium             Medium
  Phase.6.Complexity Transaction.Feasibility Approach
1               High                  Medium        B
2             Medium                  Medium        I
3               High                  Medium        B
4             Medium                  Medium        I
5             Medium                  Medium        B

以下是我希望用来查找正确替换值的查找表。

> cost.approach.difficulty
  Approach Difficulty   Phase 0  Phase 1  Phase 2   Phase 3  Phase 4  Phase 5  Phase 6
1        B       High 18102.778 29481.67 29481.67 11822.222 30737.78 21634.67 12768.00
2        B        Low  3860.694 15978.47 11175.69  7448.000 12768.00 11467.56 11467.56
3        B     Medium  5323.694 24974.44 15184.17  9221.333 15368.89 12768.00 12768.00
4        I       High 18102.778 74184.44 29481.67 44747.111 69160.00 45249.56 32245.11
5        I        Low  3860.694 26008.89 11175.69 16551.111 35910.00 16876.22 14275.33
6        I     Medium  5323.694 41156.11 15184.17 22373.556 44776.67 23378.44 16876.22
7       RV       High 18102.778 28373.33 29481.67 44747.111 69160.00 45249.56 32245.11
8       RV        Low  3860.694 14870.14 11175.69 16551.111 44776.67 16876.22 14275.33
9       RV     Medium  5323.694 22757.78 15184.17 22373.556 44776.67 23378.44 16876.22

我正在尝试找到一个简单的解决方案,在cost.approach.difficulty表中查找“接近”和“难度”的相应值。

所以例如,在cost.table中,我想要第一行,Phase.0.Difficulty,用3860.694代替(因为它是'B'方法并且难度低。

有没有人有一个优雅,简单的解决方案来查找基于两个(或更多列)的值并沿多个列替换值?

谢谢,

安德鲁

更新 -

有两个与使用合并相关的建议答案。我的目标是找到一个更简洁,简洁,优雅的解决方案。这是迄今为止我提出的最好的方法:

cost.approach.difficulty$Phase.0[match(paste(cost.table$Approach, cost.table$Phase.0.Difficulty), paste(cost.approach.difficulty$Approach, cost.approach.difficulty$Difficulty))]

这个解决方案的问题是我需要提前知道列名,但仍然看起来像是黑客。任何人都有一个更简洁的解决方案?

3 个答案:

答案 0 :(得分:4)

如果您希望这适用于可变数量的列,我建议将您的成本表和查找表重新整理为更标准化的格式。

首先,如果您以可重复的格式提供数据,那么回答这个问题会更容易:

# Create the example data
cost.table <- data.frame(
  "Identifier" = c("FS1", "FS2",  "FS3",  "FS4",  "FS5"),
  "Phase.0.Difficulty" = c("Low", "High", "High", "High", "High"),
  "Phase.1.Complexity" = c("Low", "High", "Low", "Medium", "Medium"),
  "Phase.2.Complexity" = c("Low", "High", "Low", "Medium", "Medium"),
  "Phase.3.Complexity" = c("Medium", "Medium", "High", "Medium", "High"),
  "Phase.4.Complexity" = c("Medium", "Medium", "High", "Medium", "Medium"),
  "Phase.5.Complexity" = c("High", "Medium", "High", "Medium", "Medium"),
  "Phase.6.Complexity" = c("High", "Medium", "High", "Medium", "Medium"),
  "Transaction.Feasibility" = c("Medium", "Medium", "Medium", "Medium", "Medium"),
  "Approach" = c("B", "I", "B", "I", "B"),
  stringsAsFactors = FALSE)

cost.approach.difficulty <- data.frame(
  "Approach" = c("B", "B", "B", "I", "I", "I", "RV", "RV", "RV"),
  "Difficulty" = c("High", "Low", "Medium", "High", "Low", "Medium", "High", "Low", "Medium"),
  "Phase.0" = c(18102.778, 3860.694, 5323.694, 18102.778, 3860.694, 5323.694, 18102.778, 3860.694, 5323.694),
  "Phase.1" = c(29481.67,15978.47, 24974.44, 74184.44, 26008.89, 41156.11, 28373.33, 14870.14, 22757.78),
  "Phase.2" = c(29481.67, 11175.69, 15184.17, 29481.67, 11175.69, 15184.17, 29481.67, 11175.69, 15184.17),
  "Phase.3" = c(11822.222, 7448, 9221.333, 44747.111, 16551.111, 22373.556, 44747.111, 16551.111, 22373.556),
  "Phase.4" = c(30737.78, 12768, 15368.89, 69160, 35910, 44776.67, 69160, 44776.67, 44776.67),
  "Phase.5" = c(21634.67, 11467.56, 12768, 45249.56, 16876.22, 23378.44, 45249.56, 16876.22, 23378.44),
  "Phase.6" = c(12768, 11467.56, 12768, 32245.11, 14275.33, 16876.22, 32245.11, 14275.33, 16876.22),
  stringsAsFactors = FALSE)

重新创建示例数据后,我使用了melt.data.frame包中的reshape2函数:

# Reshape the data
require(reshape2)

cost.table <- melt(cost.table, id.vars = c("Identifier", "Approach"), 
  value.name = "Size")
cost.table$Phase <- gsub("(\\w+\\.\\d+)\\.(\\w+)", "\\1", 
  as.character(cost.table$variable), perl = TRUE)
cost.table$Type <- gsub("(\\w+\\.\\d+)\\.(\\w+)", "\\2", 
  as.character(cost.table$variable), perl = TRUE)

head(cost.table)

  Identifier Approach           variable Size   Phase       Type
1        FS1        B Phase.0.Difficulty  Low Phase.0 Difficulty
2        FS2        I Phase.0.Difficulty High Phase.0 Difficulty
3        FS3        B Phase.0.Difficulty High Phase.0 Difficulty
4        FS4        I Phase.0.Difficulty High Phase.0 Difficulty
5        FS5        B Phase.0.Difficulty High Phase.0 Difficulty
6        FS1        B Phase.1.Complexity  Low Phase.1 Complexity

cost.approach.difficulty <- melt(cost.approach.difficulty, 
  id.vars = c("Difficulty", "Approach"), variable.name = "Phase")
cost.approach.difficulty$Phase <- as.character(cost.approach.difficulty$Phase)
cost.approach.difficulty$Type <- "Difficulty"
colnames(cost.approach.difficulty)[
  colnames(cost.approach.difficulty) == "Difficulty"] <- "Size"

head(cost.approach.difficulty)

    Size Approach   Phase     value       Type
1   High        B Phase.0 18102.778 Difficulty
2    Low        B Phase.0  3860.694 Difficulty
3 Medium        B Phase.0  5323.694 Difficulty
4   High        I Phase.0 18102.778 Difficulty
5    Low        I Phase.0  3860.694 Difficulty
6 Medium        I Phase.0  5323.694 Difficulty

两张表格都是标准格式后,您可以拨打merge

cost.table.filled <- merge(cost.table, cost.approach.difficulty, 
  by = c("Approach", "Size", "Phase", "Type"), all.x = TRUE, all.y = FALSE)

然后,如果您没有查找某些列的值,则可以重新插入原始值(否则最终会产生一堆NAs):

cost.table.filled$value[is.na(cost.table.filled$value)] <- 
  cost.table.filled$Size[is.na(cost.table.filled$value)]

然后你可以dcast将这个东西重新变成原始格式:

cost.table.final <- dcast(cost.table.filled, Identifier + Approach ~ Phase + Type)

head(cost.table.final)

  Identifier Approach Phase.0_Difficulty Phase.1_Complexity Phase.2_Complexity Phase.3_Complexity Phase.4_Complexity Phase.5_Complexity Phase.6_Complexity Transaction.Feasibility_Transaction.Feasibility
1        FS1        B           3860.694                Low                Low             Medium             Medium               High               High                                          Medium
2        FS2        I          18102.778               High               High             Medium             Medium             Medium             Medium                                          Medium
3        FS3        B          18102.778                Low                Low               High               High               High               High                                          Medium
4        FS4        I          18102.778             Medium             Medium             Medium             Medium             Medium             Medium                                          Medium
5        FS5        B          18102.778             Medium             Medium               High             Medium             Medium             Medium                                          Medium

要替换所有列,我会melt每个查找表,然后cbind将它们一起放入一个查找表中。这样,您只需拨打一次merge,就不必担心更换NAs。

答案 1 :(得分:0)

在这种情况下,merge应该可以解决问题:

cost.table <- merge(
  x = cost.table,
  y = cost.approach.difficulty[c("Approach", "Difficulty", "Phase.0")],
  by.x = c("Phase.0.Difficulty", "Approach"),
  by.y = c("Difficulty", "Approach"), sort = FALSE
)
cost.table$Phase.0.Difficulty <- NULL
names(cost.table)[names(cost.table) == "Phase.0"] <- "Phase.0.Difficulty"

cost.table
  Approach Identifier Phase.1.Complexity Phase.2.Complexity Phase.3.Complexity Phase.4.Complexity Phase.5.Complexity Phase.6.Complexity Transaction.Feasibility Phase.0.Difficulty
1        B        FS1                Low                Low             Medium             Medium               High               High                  Medium           3860.694
2        I        FS2               High               High             Medium             Medium             Medium             Medium                  Medium          18102.778
3        I        FS4             Medium             Medium             Medium             Medium             Medium             Medium                  Medium          18102.778
4        B        FS3                Low                Low               High               High               High               High                  Medium          18102.778
5        B        FS5             Medium             Medium               High             Medium             Medium             Medium                  Medium          18102.778

答案 2 :(得分:0)

最简单的答案似乎是:

  • 使用粘贴
  • 组合查找列
  • 使用匹配从查找表中查找行号

下面的代码用一行完成多列查找。

    cost.approach.difficulty$Phase.0[match(paste(cost.table$Approach, 
cost.table$Phase.0.Difficulty), paste(cost.approach.difficulty$Approach, 
cost.approach.difficulty$Difficulty))]

要遍历多个列,for循环可以正常工作。

不幸的是,我希望有一个本机解决方案可能采用了一个列向量并将它们组合起来进行查找,但我还没有找到它。我将检查其他包,看看是否存在这样的函数。