将单列拆分为多列二进制矩阵

时间:2015-05-24 12:30:50

标签: r binary

我在R中有一个大数据集,其中有几个人在一列中的一行中列出了几行。

      ID Elevation Year Individual.code
1  Area1      11.0 2009              AA
2  Area1      11.0 2009              AB
3  Area3      79.5 2009              AA
4  Area3      79.5 2009              AC
5  Area3      79.5 2009              AD
6  Area5      57.5 2010              AE
7  Area5      57.5 2010              AB
8  Area7     975.0 2011              AA
9  Area7     975.0 2011              AB

我想通过将“单个代码”拆分为二进制矩阵来创建矩阵,而不会丢失其余的变量,即ID,Elevation和Year

#     ID Elevation Year AA AB AC AD AE
#1 Area1      11.0 2009  1  1  0  0  0
#2 Area3      79.5 2009  1  0  1  1  0
#3 Area5      57.5 2010  0  1  0  0  1
#4 Area7     975.0 2011  1  1  0  0  0

3 个答案:

答案 0 :(得分:1)

DF <- read.table(text = "      ID Elevation Year Individual.code
                 1  Area1      11.0 2009              AA
                 2  Area1      11.0 2009              AB
                 3  Area3      79.5 2009              AA
                 4  Area3      79.5 2009              AC
                 5  Area3      79.5 2009              AD
                 6  Area5      57.5 2010              AE
                 7  Area5      57.5 2010              AB
                 8  Area7     975.0 2011              AA
                 9  Area7     975.0 2011              AB", header = TRUE)

library(reshape2)
dcast(DF, ID + Elevation + Year ~ Individual.code, 
      fun.aggregate = function(x) as.integer(length(x) > 0))
#     ID Elevation Year AA AB AC AD AE
#1 Area1      11.0 2009  1  1  0  0  0
#2 Area3      79.5 2009  1  0  1  1  0
#3 Area5      57.5 2010  0  1  0  0  1
#4 Area7     975.0 2011  1  1  0  0  0

答案 1 :(得分:1)

这是一种方法:

dat <- read.table(text = "      ID Elevation Year Individual.code
1  Area1      11.0 2009              AA
2  Area1      11.0 2009              AB
3  Area3      79.5 2009              AA
4  Area3      79.5 2009              AC
5  Area3      79.5 2009              AD
6  Areas      57.5 2010              AE
7  Area5      57.5 2010              AB
8  Area7     975.0 2011              AA
9  Area7     975.0 2011              AB", header = TRUE)

if (!require("pacman")) install.packages("pacman"); library(pacman)
p_load(qdapTools, dplyr)

mtabulate(split(dat[["Individual.code"]], dat[["ID"]])) %>%
    matrix2df("ID") %>%
    left_join(distinct(select(dat, -Individual.code)), .)  

##      ID Elevation Year AA AB AC AD AE
## 1 Area1      11.0 2009  1  1  0  0  0
## 2 Area3      79.5 2009  1  0  1  1  0
## 3 Area5      57.5 2010  0  1  0  0  1
## 4 Area7     975.0 2011  1  1  0  0  0

答案 2 :(得分:1)

您可以尝试dplyr/tidyr

library(dplyr)
library(tidyr)
spread(dat, Individual.code, Individual.code) %>% 
                  mutate_each(funs((!is.na(.))+0L), AA:AE)   
#     ID Elevation Year AA AB AC AD AE
#1 Area1      11.0 2009  1  1  0  0  0
#2 Area3      79.5 2009  1  0  1  1  0
#3 Area5      57.5 2010  0  1  0  0  1
#4 Area7     975.0 2011  1  1  0  0  0

或者您可以使用reshape

中的base R
 res <- reshape(cbind(dat, Col=1), idvar=c('ID', 'Elevation', 'Year'), 
           timevar='Individual.code', direction='wide')
 res[is.na(res)] <- 0