如何从多个具有相同前缀的变量中选择1个?

时间:2019-04-03 14:32:15

标签: r join

继续我的上一个问题How do I return multiple columns without consider Na values and group by other columns name in R?

Mexico_01 <- c(1,2,5,1,NA,1)
Mexico_02 <- c(3,NA,2,0,4,1)
Argentina_01 <- c(2,1,5,2,NA,2)
Argentina_02 <- c(2,3,NA,2,2,2)
Italy<- c(NA,10,10,10,NA,10)
Spain_01 <- c(2,NA,4,6,8,11)
Spain_02 <- c(3,4,NA,11,11,11)
England <- c(5,NA,10,NA,NA,12)
Germany <- c(1,NA,NA,NA,NA,10)
Data_Risk = data.frame( Mexico_01, Mexico_02, Argentina_01, Argentina_02, 
Italy, Spain_01, Spain_02, England, Germany)

Data_Risk <- as.data.table(Data_Risk)
library(data.table)
library(magrittr)
all_variable <- as.data.table(which(!is.na(Data_Risk), arr.ind = T))
all_variable [, .(colnm = names(Data_Risk)[col], col = paste0('var', 

order(col))) , by = row] %>%  dcast(row ~ col, value.var = 'colnm')

给予

row      var1         var2         var3         var4     var5     var6     
var7
1:   1 Mexico_01    Mexico_02 Argentina_01 Argentina_02 Spain_01 Spain_02  
England

2:   2 Mexico_01 Argentina_01 Argentina_02        Italy Spain_02     <NA>     
<NA>

3:   3 Mexico_01    Mexico_02 Argentina_01        Italy Spain_01  England     
<NA>

4:   4 Mexico_01    Mexico_02 Argentina_01 Argentina_02    Italy Spain_01 

Spain_02

5:   5 Mexico_02 Argentina_02     Spain_01     Spain_02     <NA>     <NA>     
 <NA>

6:   6 Mexico_01    Mexico_02 Argentina_01 Argentina_02    Italy Spain_01 
 Spain_02

 var8          var9
 1: Germany    <NA>
 2:    <NA>    <NA>
 3:    <NA>    <NA>
 4:    <NA>    <NA>
 5:    <NA>    <NA>
 6: England Germany

在这种情况下,我只需要考虑所有具有相同前缀的变量,例如:代替mexico_01或mexico_02只选择墨西哥。

所以决赛桌必须像这样:

var1           var2          var3       var4     var5    var6
mexico    argentina       england    germany     null    null
mexico    argentina         italy       null     null    null 
mexico    argentina         italy      spain  england    null
mexico    argentina         italy      spain     null    null
spain      null             null       null      null    null
mexico    argentina         italy      spain england  germany

1 个答案:

答案 0 :(得分:0)

我们可以用tstrsplit拆分列,基于'row','V1'列获取duplicated id,将'V1'中的那些元素分配给NA,然后执行dcast

out[, c("V1", "V2") := tstrsplit(colnm, "_")]
i1 <- out[, .I[duplicated(.SD)], .SDcols = c('row',  'V1')]
out[i1, V1 := NA_character_]
out[, V1 := V1[order(is.na(V1))], row]
dcast(out, row ~ col, value.var = "V1")[, row := NULL][]

数据

out <-  all_variable [, .(colnm = names(Data_Risk)[col], 
         col = paste0('var',  order(col))) , by = row]