嗨我有一个看起来像这样的数据集
name1 a b c d
name2 a c e g i
name3 t j i m n z
dput
输出:
structure(c("name1", "name2", "name3", "a ", "a", "r ", "b", "c", "k ", "c", "e", "l", "d", "t", "o", "e", "j", "m", "", "k", "n"), .Dim = c(3L, 7L), .Dimnames = list(NULL, c("V1", "V2", "V3", "V4", "V5", "V6", "V7")))
我想像这样转换为矩阵
a b c d e g i j m n t z
name1 1 1 1 1 0 0 0 0 0 0 0 0
name2 1 0 1 0 1 1 1 0 0 0 0 0
name3 0 0 0 0 0 0 1 1 1 1 1 1
我怎样才能在R?
中这样做答案 0 :(得分:3)
## Assuming this is your starting data
dat <- read.table(text="name1 a b c d NA NA\nname2 a c e g i NA\nname3 t j i m n z")
rownames(dat) <- dat$V1
dat$V1 <- NULL
我假设您的数据类似于上述内容。
## store the rownames
NM <- rownames(dat) # or NM <- c("name1", "name2", "name3")
## IMPORTANT. Make sure you have characters, not factors.
dat <- sapply(dat, as.character)
cols <- sort(unique(as.character(unlist(dat))))
results <- sapply(cols, function(cl) apply(dat, 1, `%in%`, x=cl))
results[] <- as.numeric(results)
rownames(results) <- NM
results
a b c d e g i j m n t z
name1 1 1 1 1 0 0 0 0 0 0 0 0
name2 1 0 1 0 1 1 1 0 0 0 0 0
name3 0 0 0 0 0 0 1 1 1 1 1 1
答案 1 :(得分:2)
这是一种方式:
qw = function(s) unlist(strsplit(s,'[[:blank:]]+'))
name1 <- qw("a b c d")
name2 <- qw("a c e g i")
name3 <- qw("t j i m n z")
rows <- qw("name1 name2 name3")
cols <- sort(unique(c(name1,name2,name3)))
nr <- length(rows)
nc <- length(cols)
outmat <- matrix(0,nr,nc,dimnames=list(rows,cols))
for (i in rows){
outmat[i,get(i)] <- 1
}
# a b c d e g i j m n t z
# name1 1 1 1 1 0 0 0 0 0 0 0 0
# name2 1 0 1 0 1 1 1 0 0 0 0 0
# name3 0 0 0 0 0 0 1 1 1 1 1 1
函数qw
不是必需的,但在您的示例中更容易阅读。
答案 2 :(得分:1)
使用矩阵索引可以获得最佳速度。这是一个例子(带有注释,所以你可以看到发生了什么)。
## Assuming this is your starting data
dat <- read.table(text="name1 a b c d NA NA\nname2 a c e g i NA\nname3 t j i m n z")
rownames(dat) <- dat$V1
dat$V1 <- NULL
## Convert the data.frame into a single character vector
A <- unlist(lapply(dat, as.character), use.names = FALSE)
## Identify the unique levels
levs <- sort(unique(na.omit(A)))
## Get the index position for the Row/Column combination
## that needs to be recoded as "1"
Rows <- rep(sequence(nrow(dat)), ncol(dat))
Cols <- match(A, levs)
## Create an empty matrix
m <- matrix(0, nrow = nrow(dat), ncol = length(levs),
dimnames = list(rownames(dat), levs))
## Use matrix indexing to replalce the relevant values with 1
m[cbind(Rows, Cols)] <- 1L
m
# a b c d e g i j m n t z
# name1 1 1 1 1 0 0 0 0 0 0 0 0
# name2 1 0 1 0 1 1 1 0 0 0 0 0
# name3 0 0 0 0 0 0 1 1 1 1 1 1
我在创建初始data.table
的30000行版本后,对里卡多的答案,我的data.frame
答案以及矩阵索引答案进行了基准测试。结果如下:
dat2 <- dat ## A backup
dat <- do.call(rbind, replicate(10000, dat, simplify = FALSE))
dim(dat)
# [1] 30000 6
library(microbenchmark)
microbenchmark(AM(), AMDT(), RS(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# AM() 44.30915 56.21873 57.95815 86.1518 265.3053 10
# AMDT() 231.71928 245.64236 291.19601 376.8983 515.8216 10
# RS() 4414.01127 4698.47293 4731.72877 5484.6185 5726.8092 10
矩阵索引明显胜出,但考虑到data.table
语法的简洁以完成工作,我更喜欢这种方法!很棒的工作@Arun将Hadley与“reshape2”的工作移植到data.table
!!!
这是一个“data.table”替代方案。它至少需要1.8.11版的“data.table”。
library(data.table)
library(reshape2)
packageVersion("data.table")
# [1] ‘1.8.11’
melt
和cast
您的data.table
DT <- data.table(dat, keep.rownames=TRUE)
dcast.data.table(melt(DT, id.vars="rn"), rn ~ value)
# Aggregate function missing, defaulting to 'length'
# rn NA a b c d e g i j m n t z
# 1: name1 0 1 1 1 1 0 0 0 0 0 0 0 0
# 2: name2 0 1 0 1 0 1 1 1 0 0 0 0 0
# 3: name3 0 0 0 0 0 0 0 1 1 1 1 1 1