我有一些数据;
df <- data.frame(client = c('123','124','125','126','127','128','129','130','131','132'),CN_SCORE = rnorm(1:10), VN_SCORE = rnorm(1:10), CS_SCORE = rnorm(1:10),
code = c('CN',NA,'VN','CS','PO','CS',NA,'BE','VN','CN'))
看起来像什么
client CN_SCORE VN_SCORE CS_SCORE code
1 123 -0.5068107 -0.3046385 0.1605428 CN
2 124 1.3479882 1.0065622 -1.9616174 <NA>
3 125 -0.6053786 -1.7545071 -0.2966574 VN
4 126 0.5240396 0.2735298 1.8139150 CS
5 127 1.3968190 0.3687705 -0.2310896 PO
6 128 0.8715533 0.6128183 -0.7857413 CS
7 129 0.9773130 0.3007104 0.1753607 <NA>
8 130 0.3931267 1.4056442 -1.8190026 BE
9 131 1.1310017 0.9495555 -0.1323718 VN
10 132 -0.3564904 0.2727310 1.5854258 CN
如果该行上的*_SCORE
列的值与code
列名的第一部分匹配,我需要将任何*_SCORE
列的值更改为零。数据应该看起来像这样;
client CN_SCORE VN_SCORE CS_SCORE code
1 123 0.00000000 -1.0634683 -0.1879564 CN
2 124 -0.07422132 1.0110481 -1.1999992 <NA>
3 125 -0.82198648 0.0000000 0.6195473 VN
4 126 1.50037013 0.9809830 0.0000000 CS
5 127 0.95236148 0.8130459 0.3088777 PO
6 128 -0.44263511 1.7038295 0.0000000 CS
7 129 -0.36307930 -0.5400340 0.5164958 <NA>
8 130 0.74714432 1.2763654 -0.4331117 BE
9 131 -0.64397662 0.0000000 -0.1199963 VN
10 132 0.00000000 0.5815852 0.6068514 CN
我的实际数据有大约80个*_SCORE
列。
有一种简单的方法吗?
谢谢
答案 0 :(得分:1)
涉及dplyr
和tidyr
的一种可能性是:
df %>%
gather(var, val, -c(client, code)) %>%
mutate(val = case_when(is.na(code) ~ val,
code == substr(var, 1, 2) ~ 0,
TRUE ~ val)) %>%
spread(var, val)
client code CN_SCORE CS_SCORE VN_SCORE
1 123 CN 0.0000000 0.05969744 -1.2730816
2 124 <NA> -0.3966455 -0.03788638 0.8005320
3 125 VN -2.3405085 -0.74085810 0.0000000
4 126 CS -1.0002777 0.00000000 1.0621683
5 127 PO 0.5921431 0.11958964 -0.4922398
6 128 CS 1.5583560 0.00000000 0.6772933
7 129 <NA> 1.3697855 0.75409401 1.5662150
8 130 BE 0.1221992 -1.04877408 -0.1939984
9 131 VN 2.5151293 0.33135690 0.0000000
10 132 CN 0.0000000 2.25564140 -0.4702173
答案 1 :(得分:1)
这是一个通过dplyr
的想法。我们转换为长格式,使用简单的正则表达式提取名称的第一部分,然后与code
进行简单比较。完成后,我们spread
回到宽格式,即
library(dplyr)
library(tidyr)
df %>%
gather(var, val, -c(client, code)) %>%
mutate(val = replace(val, sub('_.*', '', var) == code, 0)) %>%
spread(var, val)
client code CN_SCORE CS_SCORE VN_SCORE 1 123 CN 0.0000000 0.2828444 -0.75224398 2 124 <NA> -0.5815069 -0.1053807 -0.03881512 3 125 VN -0.4489411 -1.3682422 0.00000000 4 126 CS -2.4349032 0.0000000 0.75258368 5 127 PO -1.7483976 1.3793556 -0.59094268 6 128 CS 0.2732683 0.0000000 -0.98756547 7 129 <NA> -0.9394162 -1.5184852 -0.20126150 8 130 BE -0.8731287 -0.2340674 -0.68192984 9 131 VN 0.3726439 2.1826383 0.00000000 10 132 CN 0.0000000 3.0400324 -0.33033666
答案 2 :(得分:1)
使用基数R,矢量化方法将是创建行/列矩阵以替换数据帧中的值。我们删除下划线后的所有内容,并使用列名match
将其删除以获取列索引。要获取行索引,我们找到code
以外的NA
值,这些值包含在cols
中。
cols <- sub("_.*", "", names(df))
inds <- which(!is.na(df$code) & df$code %in% cols)
df[cbind(inds, match(df$code[inds],cols))] <- 0
df
# client CN_SCORE VN_SCORE CS_SCORE code
#1 123 0.0000000 -0.23627957 -1.3108015 CN
#2 124 -1.0959963 -0.19717589 1.9972134 <NA>
#3 125 0.0377884 0.00000000 0.6007088 VN
#4 126 0.3104807 0.08473729 0.0000000 CS
#5 127 0.4365235 0.75405379 -0.6111659 PO
#6 128 -0.4583653 -0.49929202 0.0000000 CS
#7 129 -1.0633261 0.21444531 2.1988103 <NA>
#8 130 1.2631852 -0.32468591 1.3124130 BE
#9 131 -0.3496504 0.00000000 -0.2651451 VN
#10 132 0.0000000 -0.89536336 0.5431941 CN
答案 3 :(得分:1)
如果对grepl进行矢量化处理,则如果在code
中找到列名(第一部分),则矩阵为TRUE,否则为FALSE。反转TRUE和FALSE并通过原始列对矩阵进行多重运算即可得到所需的结果。
cols <- grep('_SCORE', names(df), value = TRUE)
df[cols] <- df[cols]*!Vectorize(grepl, 'pattern')(substr(cols, 1, 2), df$code)
df
# client CN_SCORE VN_SCORE CS_SCORE code
# 1 123 0.0000000 -0.1119434 1.02750890 CN
# 2 124 0.3511996 0.2970757 1.11384814 <NA>
# 3 125 -0.1495255 0.0000000 1.29628327 VN
# 4 126 -0.3645585 0.3932262 0.00000000 CS
# 5 127 -0.2272243 1.4857947 -2.12265618 PO
# 6 128 -0.1615514 0.1449268 0.00000000 CS
# 7 129 0.5020869 1.6921847 0.01622139 <NA>
# 8 130 0.6160465 0.4361738 1.62195307 BE
# 9 131 -2.8887592 0.0000000 -0.68922501 VN
# 10 132 0.0000000 -0.5525893 -0.13748636 CN