通过匹配

时间:2016-08-11 19:52:53

标签: r

我遇到重新安排一些数据的问题。

原始数据是:

structure(list(id = 1:3, artery.1 = structure(c(1L, 1L, 2L), .Label = c("a", 
"b"), class = "factor"), artery.2 = structure(c(1L, NA, 2L), .Label = c("b", 
"c"), class = "factor"), artery.3 = structure(c(1L, NA, 2L), .Label = c("c", 
"d"), class = "factor"), artery.4 = structure(c(NA, NA, 1L), .Label = "e", class = "factor"), artery.5 = structure(c(NA, NA, 1L), .Label = "f", class = "factor"), 
diameter.1 = c(3L, 2L, 1L), diameter.2 = c(2L, NA, 2L), diameter.3 = c(3L, 
NA, 3L), diameter.4 = c(NA, NA, 4L), diameter.5 = c(NA, NA, 
5L)), .Names = c("id", "artery.1", "artery.2", "artery.3", 
"artery.4", "artery.5", "diameter.1", "diameter.2", "diameter.3", 
"diameter.4", "diameter.5"), class = "data.frame", row.names = c(NA, 
-3L))

#   id artery.1 artery.2 artery.3 artery.4 artery.5 diameter.1 diameter.2 diameter.3 diameter.4 diameter.5
# 1  1        a        b        c     <NA>     <NA>          3          2          3         NA         NA
# 2  2        a     <NA>     <NA>     <NA>     <NA>          2         NA         NA         NA         NA
# 3  3        b        c        d        e        f          1          2          3          4          5

我想谈谈这个问题:

structure(list(id = 1:3, a = c(3L, 2L, NA), b = c(2L, NA, 1L), 
c = c(3L, NA, 2L), d = c(NA, NA, 3L), e = c(NA, NA, 4L), 
f = c(NA, NA, 5L)), .Names = c("id", "a", "b", "c", "d", 
"e", "f"), class = "data.frame", row.names = c(NA, -3L))

#   id  a  b  c  d  e  f
# 1  1  3  2  3 NA NA NA
# 2  2  2 NA NA NA NA NA
# 3  3 NA  1  2  3  4  5

基本上,af代表动脉,数值代表相应的直径。每行代表一名患者。

有没有一种巧妙的方法来排序这个数据帧?

4 个答案:

答案 0 :(得分:3)

使用 tidyr dplyr 包。

library(dplyr)
library(tidyr)

new.df <- gather(df, variable, value, artery.1:diameter.5) %>% 
    separate(variable, c('variable', 'num')) %>% 
    spread(variable, value) %>% 
    subset(!is.na(artery)) %>%
    mutate(diameter = as.numeric(diameter)) %>% 
    select(-num) %>% 
    spread(artery, diameter)

输出:

  id  a  b  c  d  e  f
1  1  3  2  3 NA NA NA
2  2  2 NA NA NA NA NA
3  3 NA  1  2  3  4  5

答案 1 :(得分:2)

melt函数中使用正则表达式选择变量时,使用dcast / data.tablepatterns组合使用

library(data.table) #v>=1.9.6
dcast(melt(setDT(df), 
           id = "id", 
           measure = patterns("artery", "diameter")),
      id ~ value1, 
      sum, 
      value.var = "value2", 
      subset = .(!is.na(value2)), 
      fill = NA)
#    id  a  b  c  d  e  f
# 1:  1  3  2  3 NA NA NA
# 2:  2  2 NA NA NA NA NA
# 3:  3 NA  1  2  3  4  5

如您所见,meltdcast都非常灵活,您可以使用正则表达式,指定子集,传递多个函数并指定填充缺失值的方式。

答案 2 :(得分:1)

您可以将xtabs与基础R中的reshape一起使用。使用后者将数据转换为长格式并使用前者获取计数表:

xtabs(diameter ~ id + artery, reshape(df, varying = 2:11, sep = '.', dir = "long"))

#   artery
#id  a b c d e f
#  1 3 2 3 0 0 0
#  2 2 0 0 0 0 0
#  3 0 1 2 3 4 5

答案 3 :(得分:1)

这可以通过两次reshape()次来电来完成。首先,我们可以在artery上对diameterid进行缩写,然后使用artery作为时间变量进行扩展。为了防止一列NA,我们还必须在中间帧中对artery的NA值进行子集化。

reshape(subset(reshape(df,dir='l',varying=setdiff(names(df),'id'),timevar=NULL),!is.na(artery)),dir='w',timevar='artery');
##     id diameter.a diameter.b diameter.c diameter.d diameter.e diameter.f
## 1.1  1          3          2          3         NA         NA         NA
## 2.1  2          2         NA         NA         NA         NA         NA
## 3.1  3         NA          1          2          3          4          5

如果需要,之后可以删除diameter.前缀。但是,此解决方案的一个优点是它能够保留多个列集,而xtabs()解决方案则不能。在这种情况下,前缀对于区分列集是必不可少的。