我有一个包含许多列的数据集,但我只会提到这个操作所必需的数据集并提供临时的数据集(我也相信你不需要我刚才包含它的ID信息以便更容易理解)
Business Division | Local Claim ID| CMB
GC 123 **Y**
GC 124 N
NAC 125 N
NAC 126 N
NAC 127 **Y**
GC 128 N
我想摆脱CMB专栏,而如果原始值是Y,我会替换每个业务部门的CMB值,基本上我希望表格看起来如下:( Business Divison现在有3个班级)
Business Division | Local Claim ID
**CMB** 123
GC 124
NAC 125
NAC 126
**CMB** 127
GC 128
以下是dput
重现我的数据的输出:
structure(list(Business.Division = c("CMB", "GC", "NAC", "NAC",
"CMB", "GC"), Local.Claim.ID = 123:128, CMB = c("Y", "N",
"N", "N", "Y", "N")), .Names = c("Business.Division", "Local.Claim.ID",
"CMB"), row.names = c(NA, -6L), class = "data.frame")
答案 0 :(得分:4)
如果您想同时评估相关行并及时更新,我会选择data.table
library(data.table)
setDT(df)[CMB == "Y", Business.Division := "CMB"][, CMB := NULL]
# Business.Division Local.Claim.ID
# 1: CMB 123
# 2: GC 124
# 3: NAC 125
# 4: NAC 126
# 5: CMB 127
# 6: GC 128
答案 1 :(得分:3)
尝试:
df %>%
mutate(Business.Division = replace(Business.Division, which(CMB == 'Y'), 'CMB')) %>%
select(-CMB)
给出了:
# Business.Division Local.Claim.ID
#1 CMB 123
#2 GC 124
#3 NAC 125
#4 NAC 126
#5 CMB 127
#6 GC 128
<强>基准强>
更新以添加关于基准的建议:
df <- data.frame(Business.Division = sample(c("GC", "NAC"), 10e6, replace = TRUE),
Local.Claim.ID = sample(100:199, 10e6, replace = TRUE),
CMB = sample(c("Y", "N"), 10e6, replace = TRUE),
stringsAsFactors = FALSE)
library(microbenchmark)
mbm <- microbenchmark(
me = mutate(df, Business.Division = replace(Business.Division,which(CMB == "Y"), "CMB")),
stevensp = (df$Business.Division <- ifelse(df$CMB == "Y", "CMB", df$Business.Division)),
mts = (df$Business.Division[which(df$CMB == "Y")] = "CMB"),
david1 = setDT(df)[CMB == "Y", Business.Division := "CMB"],
david2 = setkey(setDT(df), CMB)[.("Y"), Business.Division := "CMB"],
times = 10
)
大卫的速度要快得多:
> mbm
Unit: milliseconds
expr min lq mean median uq max neval cld
me 496.79251 556.70752 592.35165 608.23875 634.88809 661.33805 10 b
stevensp 3449.53516 3518.47649 3585.91006 3572.62433 3681.19332 3718.06284 10 c
mts 591.22479 654.01000 661.02210 661.41281 679.53060 719.74752 10 b
david1 58.67554 62.15468 66.85337 62.31426 62.99337 92.49148 10 a
david2 86.04280 89.42500 117.76540 89.61656 89.79652 232.45398 10 a
答案 2 :(得分:2)
您可以单独使用ifelse
(无需任何其他软件包):
df$Business.Division = ifelse(df$CMB == "Y", "CMB", df$Business.Division)
正如DavidArenburg在下面的评论中所指出的,这对于处理大数据来说效率不高。但是如果你的数据不是很大,这是一个很好的,简单的方法。
答案 3 :(得分:1)
或(仅当businessdivision
不是factor
时才有效)
df$businessdivision[which(df$CMB == "Y")] = "CMB"