根据列名和第二个值之间的匹配来更改值

时间:2019-09-16 12:05:09

标签: r

我有一些数据;

df <- data.frame(client = c('123','124','125','126','127','128','129','130','131','132'),CN_SCORE = rnorm(1:10), VN_SCORE = rnorm(1:10), CS_SCORE = rnorm(1:10),
                 code = c('CN',NA,'VN','CS','PO','CS',NA,'BE','VN','CN'))

看起来像什么

  client   CN_SCORE   VN_SCORE   CS_SCORE code
1     123 -0.5068107 -0.3046385  0.1605428   CN
2     124  1.3479882  1.0065622 -1.9616174 <NA>
3     125 -0.6053786 -1.7545071 -0.2966574   VN
4     126  0.5240396  0.2735298  1.8139150   CS
5     127  1.3968190  0.3687705 -0.2310896   PO
6     128  0.8715533  0.6128183 -0.7857413   CS
7     129  0.9773130  0.3007104  0.1753607 <NA>
8     130  0.3931267  1.4056442 -1.8190026   BE
9     131  1.1310017  0.9495555 -0.1323718   VN
10    132 -0.3564904  0.2727310  1.5854258   CN

如果该行上的*_SCORE列的值与code列名的第一部分匹配,我需要将任何*_SCORE列的值更改为零。数据应该看起来像这样;

 client    CN_SCORE   VN_SCORE   CS_SCORE code
1     123  0.00000000 -1.0634683 -0.1879564   CN
2     124 -0.07422132  1.0110481 -1.1999992 <NA>
3     125 -0.82198648  0.0000000  0.6195473   VN
4     126  1.50037013  0.9809830  0.0000000   CS
5     127  0.95236148  0.8130459  0.3088777   PO
6     128 -0.44263511  1.7038295  0.0000000   CS
7     129 -0.36307930 -0.5400340  0.5164958 <NA>
8     130  0.74714432  1.2763654 -0.4331117   BE
9     131 -0.64397662  0.0000000 -0.1199963   VN
10    132  0.00000000  0.5815852  0.6068514   CN

我的实际数据有大约80个*_SCORE列。

有一种简单的方法吗?

谢谢

4 个答案:

答案 0 :(得分:1)

涉及dplyrtidyr的一种可能性是:

df %>%
 gather(var, val, -c(client, code)) %>%
 mutate(val = case_when(is.na(code) ~ val,
                        code == substr(var, 1, 2) ~ 0,
                        TRUE ~ val)) %>%
 spread(var, val)

   client code   CN_SCORE    CS_SCORE   VN_SCORE
1     123   CN  0.0000000  0.05969744 -1.2730816
2     124 <NA> -0.3966455 -0.03788638  0.8005320
3     125   VN -2.3405085 -0.74085810  0.0000000
4     126   CS -1.0002777  0.00000000  1.0621683
5     127   PO  0.5921431  0.11958964 -0.4922398
6     128   CS  1.5583560  0.00000000  0.6772933
7     129 <NA>  1.3697855  0.75409401  1.5662150
8     130   BE  0.1221992 -1.04877408 -0.1939984
9     131   VN  2.5151293  0.33135690  0.0000000
10    132   CN  0.0000000  2.25564140 -0.4702173

答案 1 :(得分:1)

这是一个通过dplyr的想法。我们转换为长格式,使用简单的正则表达式提取名称的第一部分,然后与code进行简单比较。完成后,我们spread回到宽格式,即

library(dplyr)
library(tidyr)

df %>% 
 gather(var, val, -c(client, code)) %>% 
 mutate(val = replace(val, sub('_.*', '', var) == code, 0)) %>% 
 spread(var, val)
   client code   CN_SCORE   CS_SCORE    VN_SCORE
1     123   CN  0.0000000  0.2828444 -0.75224398
2     124 <NA> -0.5815069 -0.1053807 -0.03881512
3     125   VN -0.4489411 -1.3682422  0.00000000
4     126   CS -2.4349032  0.0000000  0.75258368
5     127   PO -1.7483976  1.3793556 -0.59094268
6     128   CS  0.2732683  0.0000000 -0.98756547
7     129 <NA> -0.9394162 -1.5184852 -0.20126150
8     130   BE -0.8731287 -0.2340674 -0.68192984
9     131   VN  0.3726439  2.1826383  0.00000000
10    132   CN  0.0000000  3.0400324 -0.33033666

答案 2 :(得分:1)

使用基数R,矢量化方法将是创建行/列矩阵以替换数据帧中的值。我们删除下划线后的所有内容,并使用列名match将其删除以获取列索引。要获取行索引,我们找到code以外的NA值,这些值包含在cols中。

cols <- sub("_.*", "", names(df))
inds <- which(!is.na(df$code) & df$code %in% cols)
df[cbind(inds, match(df$code[inds],cols))] <- 0

df
#   client   CN_SCORE    VN_SCORE   CS_SCORE code
#1     123  0.0000000 -0.23627957 -1.3108015   CN
#2     124 -1.0959963 -0.19717589  1.9972134 <NA>
#3     125  0.0377884  0.00000000  0.6007088   VN
#4     126  0.3104807  0.08473729  0.0000000   CS
#5     127  0.4365235  0.75405379 -0.6111659   PO
#6     128 -0.4583653 -0.49929202  0.0000000   CS
#7     129 -1.0633261  0.21444531  2.1988103 <NA>
#8     130  1.2631852 -0.32468591  1.3124130   BE
#9     131 -0.3496504  0.00000000 -0.2651451   VN
#10    132  0.0000000 -0.89536336  0.5431941   CN

答案 3 :(得分:1)

如果对grepl进行矢量化处理,则如果在code中找到列名(第一部分),则矩阵为TRUE,否则为FALSE。反转TRUE和FALSE并通过原始列对矩阵进行多重运算即可得到所需的结果。

cols <- grep('_SCORE', names(df), value = TRUE)
df[cols] <- df[cols]*!Vectorize(grepl, 'pattern')(substr(cols, 1, 2), df$code)

df
#    client   CN_SCORE   VN_SCORE    CS_SCORE code
# 1     123  0.0000000 -0.1119434  1.02750890   CN
# 2     124  0.3511996  0.2970757  1.11384814 <NA>
# 3     125 -0.1495255  0.0000000  1.29628327   VN
# 4     126 -0.3645585  0.3932262  0.00000000   CS
# 5     127 -0.2272243  1.4857947 -2.12265618   PO
# 6     128 -0.1615514  0.1449268  0.00000000   CS
# 7     129  0.5020869  1.6921847  0.01622139 <NA>
# 8     130  0.6160465  0.4361738  1.62195307   BE
# 9     131 -2.8887592  0.0000000 -0.68922501   VN
# 10    132  0.0000000 -0.5525893 -0.13748636   CN