我想在此数据集中创建虚拟变量:
DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6",
"3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"),
row.names = c(NA, 5L), class = "data.frame")
> DF
A B
1 1 1,3,2
2 2 2,1,3,6
3 3 3,2,5,1,7
4 4 3,7,4,2,6,5
5 5 4,10,7,3,5,6
期望的输出应该是这样的:
A 1 2 3 4 5 6 7 8 9 10
1 1 1 1 0 0 0 0 0 0 0
2 1 1 1 0 0 1 0 0 0 0
3 1 1 1 0 1 0 1 0 0 0
4 0 1 1 1 1 1 1 0 0 0
5 0 0 1 1 1 1 1 0 0 1
有没有一种有效的方法来做这样的事情?我可以使用strsplit
或ifelse
。原始数据集非常大,具有许多行(> 10k)和列B中的值(> 15k)。来自包dummy
的函数dummies
无法正常工作。
我还发现了同样的情况:Splitting one column into multiple columns。但是我的情况下上面链接的工作真的很慢(我的戴尔i7-2630QM,8Gb,Win7 64位,R 2.15.3 64位)最多15分钟。
提前感谢你的导师。
答案 0 :(得分:17)
此处提到的功能现已移至CRAN上可用的名为“splitstackshape”的软件包。 CRAN上的版本比原始版本快得多。速度应该与本答案末尾的直接for
循环解决方案的速度类似。有关详细的基准测试,请参阅@ Ricardo的答案。
安装它,并使用concat.split.expanded
获得所需的结果:
library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
# A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
原帖
不久前,我写过一个函数,不只是这种分裂,而是其他。名为concat.split()
的函数可以找到here。
您的示例数据的用法是:
## Keeping the original column
concat.split(DF, "B", structure="expanded")
# A B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1,3,2 1 1 1 NA NA NA NA NA NA NA
# 2 2 2,1,3,6 1 1 1 NA NA 1 NA NA NA NA
# 3 3 3,2,5,1,7 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 3,7,4,2,6,5 NA 1 1 1 1 1 1 NA NA NA
# 5 5 4,10,7,3,5,6 NA NA 1 1 1 1 1 NA NA 1
## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 NA NA NA NA NA NA NA
# 2 2 1 1 1 NA NA 1 NA NA NA NA
# 3 3 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 NA 1 1 1 1 1 1 NA NA NA
# 5 5 NA NA 1 1 1 1 1 NA NA 1
将NA重新编码为0必须手动完成 - 也许我会更新函数以添加一个选项来执行此操作,同时实现以下更快的解决方案之一:)
temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
concat.split
函数中的大部分开销可能包括从matrix
转换为data.frame
,重命名列等内容。用于进行拆分的实际代码是 GASP for
循环,但测试它,你会发现它表现得相当不错:
b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)
## Fill it in
for (i in 1:nrow(DF)) {
m[i, temp[[i]]] = 1
}
## View your result
m
答案 1 :(得分:9)
在下面添加了基准
Update2:为@ Anada的解决方案添加了bechmarks。哇哇哇哇!!
为更大的数据集增加了基准,@ Anada的解决方案以更大的利润率提前加速。 “
原始答案:
如下所示,KnownMax
和UnknownMax
的效果优于data.table
解决方案。虽然,我怀疑如果有10e6 +行,那么data.table
解决方案将是最快的。 (可以通过简单地修改本文最底部的参数来对其进行基准测试)
KnownMax
如果你知道B中的最大值,那么你有一个很好的双线:
maximum <- 10
results <- t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1 1 1 0 0 0 0 0 0 0
# [2,] 1 1 1 0 0 1 0 0 0 0
# [3,] 1 1 1 0 1 0 1 0 0 0
# [4,] 0 1 1 1 1 1 1 0 0 0
# [5,] 0 0 1 1 1 1 1 0 0 1
三行,如果要命名列和行:
dimnames(results) <- list(seq(nrow(results)), seq(ncol(results)))
UnknownMax
# if you do not know the maximum ahead of time:
splat <- strsplit(DF$B, ",")
maximum <- max(as.numeric(unlist(splat)))
t(sapply(splat, `%in%`, x=1:maximum)) + 0
DT
根据@ dickoa的请求,这是data.table
的选项。 “
DT <- data.table(DF)
DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]
cols <- DT.long[, max(vals)]
rows <- DT.long[, max(A)]
matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols,
byrow=TRUE, dimnames=list(seq(rows), seq(cols)))
# 1 2 3 4 5 6 7 8 9 10
# 1 1 1 1 0 0 0 0 0 0 0
# 2 1 1 1 0 0 1 0 0 0 0
# 3 1 1 1 0 1 0 1 0 0 0
# 4 0 1 1 1 1 1 1 0 0 0
# 5 0 0 1 1 1 1 1 0 0 1
类似的设置也可以在基座R
中完成
===
以下是一些数据略大的基准测试:
microbenchmark(KnownMax = eval(KnownMax), UnknownMax = eval(UnknownMax),
DT.withAssign = eval(DT.withAssign),
DT.withOutAssign = eval(DT.withOutAssign),
lapply.Dickoa = eval(lapply.Dickoa), apply.SimonO101 = eval(apply.SimonO101),
forLoop.Ananda = eval(forLoop.Ananda), times=50L)
使用OP data.frame,结果为5 x 10
Unit: microseconds
expr min lq median uq max neval
KnownMax 106.556 114.692 122.4915 129.406 6427.521 50
UnknownMax 114.470 122.561 128.9780 136.384 158.346 50
DT.withAssign 3000.777 3099.729 3198.8175 3291.284 10415.315 50
DT.withOutAssign 2637.023 2739.930 2814.0585 2903.904 9376.747 50
lapply.Dickoa 7031.791 7315.781 7438.6835 7634.647 14314.687 50
apply.SimonO101 430.350 465.074 487.9505 522.938 7568.442 50
forLoop.Ananda 81.415 91.027 99.7530 104.588 265.394 50
使用稍大的data.frame(下面),其结果是1000 x 100
删除lapply.Dickoa
,因为我的编辑可能会减慢速度,因此它会崩溃。
Unit: milliseconds
expr min lq median uq max neval
KnownMax 34.83210 35.59068 36.13330 38.15960 52.27746 50
UnknownMax 36.41766 37.17553 38.03075 47.71438 55.57009 50
DT.withAssign 31.95005 32.65798 33.73578 43.71493 50.05831 50
DT.withOutAssign 31.36063 32.08138 32.80728 35.32660 51.00037 50
apply.SimonO101 78.61677 91.72505 95.53592 103.36052 163.14346 50
forLoop.Ananda 13.61827 14.02197 14.18899 14.58777 26.42266 50
甚至更大的设置,其结果是10,000 x 600
Unit: milliseconds
expr min lq median uq max neval
KnownMax 1583.5902 1631.6214 1658.6168 1724.9557 1902.3923 50
UnknownMax 1597.1215 1655.9634 1690.7550 1735.5913 1804.2156 50
DT.withAssign 586.4675 641.7206 660.7330 716.0100 1193.4806 50
DT.withOutAssign 587.0492 628.3731 666.3148 717.5575 776.2671 50
apply.SimonO101 1916.6589 1995.2851 2044.9553 2079.6754 2385.1028 50
forLoop.Ananda 163.4549 172.5627 182.6207 211.9153 315.0706 50
使用以下内容:
library(microbmenchmark)
library(data.table)
KnownMax <- quote(t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0)
UnknownMax <- quote({ splat <- strsplit(DF$B, ","); maximum <- max(as.numeric(unlist(splat))); t(sapply(splat, `%in%`, x=1:maximum)) + 0})
DT.withAssign <- quote({DT <- data.table(DF); DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
DT.withOutAssign <- quote({DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
lapply.Dickoa <- quote({ tmp <- strsplit(DF$B, ","); label <- 1:max(as.numeric(unlist(tmp))); tmp <- lapply(tmp, function(x) as.data.frame(lapply(label, function(y) (x == y)))); unname(t(sapply(tmp, colSums))) })
apply.SimonO101 <- quote({cols <- 1:max( as.numeric( unlist(strsplit(DF$B,",")))); t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) ) })
forLoop.Ananda <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol) ; for (i in 1:nrow(DF)) { m[i, temp[[i]]] = 1 }; m })
# slightly modified @Dickoa's alogrithm to allow for instances were B is only a single number.
# Instead of using `sapply(.)`, I used `as.data.frame(lapply(.))` which hopefully the simplification process in sapply is analogous in time to `as.data.frame`
identical(eval(lapply.Dickoa), eval(UnknownMax))
identical(eval(lapply.Dickoa), unname(eval(apply.SimonO101)))
identical(eval(lapply.Dickoa), eval(KnownMax))
identical(unname(as.matrix(eval(DT.withAssign))), eval(KnownMax))
# ALL TRUE
这是用于创建样本数据的内容:
# larger data created as follows
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF);
DT
答案 2 :(得分:5)
使用ifelse
和strsplit
可以做到这一点的方法(除非我误解了你不想要使用它们?)就像这样....
cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )
colnames(df) <- cols
df
# 1 2 3 4 5 6 7 8 9 10
#1 1 1 1 0 0 0 0 0 0 0
#2 1 1 1 0 0 1 0 0 0 0
#3 1 1 1 0 1 0 1 0 0 0
#4 0 1 1 1 1 1 1 0 0 0
#5 0 0 1 1 1 1 1 0 0 1
我们的想法是,我们在您想要的列中获取唯一值的向量,找到max
值并创建向量1:max(value)
然后应用于每一行以找出该行的哪些值在所有值的向量中。如果它在那里,我们使用ifelse
来设置1,如果不存在则使用0。我们匹配的vector
是一个序列,因此它的输出已经准备好了。
答案 3 :(得分:4)
游戏稍晚,但是另一种策略使用矩阵可以由另一个两列矩阵索引的事实,该矩阵指定用于更新的行和列索引。所以
f2 <- function(DF) {
b <- strsplit(DF$B, ",", fixed=TRUE)
len <- vapply(b, length, integer(1)) # 'geometry'
b <- as.integer(unlist(b))
midx <- matrix(c(rep(seq_len(nrow(DF)), len), b), ncol=2)
m <- matrix(0L, nrow(DF), max(b))
m[midx] <- 1L
m
}
这使用strsplit(..., fixed=TRUE)
和vapply
来提高效率和类型安全性,as.integer
和0L
,1L
因为我们真的需要整数而不是数字返回值
为了比较,这是@AnandaMahto
的原始实现f0 <- function(DF) {
b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
m = matrix(0, nrow = nrow(DF), ncol = ncol)
for (i in 1:nrow(DF)) {
m[i, temp[[i]]] = 1
}
m
}
这可以通过使用fixed=TRUE
并避免b
的双重强制来提高效率,并通过强制转换为整数并使用seq_len(nrow(DF))
来避免角落情况为0来提高效率-row DF
f1 <- function(DF) {
b = lapply(strsplit(DF$B, ",", fixed=TRUE), as.integer)
ncol = max(unlist(b))
m = matrix(0L, nrow = nrow(DF), ncol = ncol)
for (i in seq_len(nrow(DF)))
m[i, b[[i]]] = 1L
m
}
for循环是编译的理想选择,所以
library(compiler)
f1c <- cmpfun(f1)
然后对来自@RicardoSaporta的10,000 x 600数据进行比较
> library(microbenchmark)
> microbenchmark(f0(DF), f1(DF), f1c(DF), f2(DF))
Unit: milliseconds
expr min lq median uq max neval
f0(DF) 170.51388 180.25997 182.45772 188.23811 717.7511 100
f1(DF) 91.53578 97.14909 97.97195 100.24236 447.5900 100
f1c(DF) 79.39194 84.45712 85.71022 87.85763 411.8340 100
f2(DF) 76.45496 81.70307 82.50752 110.83620 398.6093 100
从f0到f1的2倍增加和for循环的相对效率对我来说都相对令人惊讶。 ncol = max(vapply(b, max, integer(1)))
答案 4 :(得分:3)
我知道已有一个很好且非常有效的答案,但我们也可以使用另一种方法来获得相同的结果。
tmp <- strsplit(DF$B, ",")
label <- 1:max(as.numeric(unlist(tmp)))
tmp <- lapply(tmp, function(x)
sapply(label, function(y) (x == y)))
t(sapply(tmp, colSums))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 1 1 0 0 0 0 0 0 0
## [2,] 1 1 1 0 0 1 0 0 0 0
## [3,] 1 1 1 0 1 0 1 0 0 0
## [4,] 0 1 1 1 1 1 1 0 0 0
## [5,] 0 0 1 1 1 1 1 0 0 1
我们现在可以对它进行基准测试,以与@ SimonO101解决方案(fun2)进行比较
require(rbenchmark)
fun1 <- function(DF) {
tmp <- strsplit(DF$B, ",")
label <- 1:max(as.numeric(unlist(tmp)))
tmp <- lapply(tmp, function(x)
sapply(label, function(y) (x == y)))
t(sapply(tmp, colSums))
}
fun2 <- function(DF) {
cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )
colnames(df) <- cols
df
}
all.equal(fun1(DF),
fun2(DF),
check.attributes = FALSE)
## [1] TRUE
benchmark(fun1(DF),
fun2(DF),
order = "elapsed",
columns = c("test", "elapsed", "relative"),
replications = 5000)
## test elapsed relative
## 1 fun1(DF) 1.870 1.000
## 2 fun2(DF) 2.018 1.079
我们可以看到没有太大的区别。
建议编辑(RS):
# from:
tmp <- lapply(tmp, function(x)
sapply(label, function(y) (x == y)))
# to:
tmp <- lapply(tmp, function(x)
as.data.frame(lapply(label, function(y) (x == y))))
答案 5 :(得分:3)
好的,这已经困扰了我一段时间,但我认为这将是一个很好的使用 Rcpp 。所以我写了一个小函数,看看我是否能比@Ananda惊人的for
循环解决方案更快地得到一些东西。该解决方案似乎运行速度大约快两倍(使用@RicardoSaporta发布的更大的样本数据集)。
注意:我正在尝试更多地教自己如何使用Rcpp和C ++而不是提供有用的解决方案,但都是一样的......
我们的.cpp
文件...
#include <Rcpp.h>
#include <string>
#include <sstream>
using namespace Rcpp;
//[[Rcpp::export]]
NumericMatrix expandR(CharacterVector x) {
int n = x.size();
std::vector< std::vector<int> > out; // list to hold numeric vectors
int tmax = 0;
for(int i = 0; i < n; ++i) {
std::vector<int> vect; // vector to hold split strings
std::string str = as<std::string>(x[i]);
std::stringstream ss(str);
int j = 0;
while (ss >> j) {
vect.push_back(j); // add integer to result vector
if (ss.peek() == ',') //split by ',' delim
ss.ignore();
}
int it = *std::max_element(vect.begin(), vect.end());
if( it > tmax )
tmax = it; //current max value
out.push_back(vect);
}
// Now we construct the matrix. tmax gives us number of columns, n is number of rows;
NumericMatrix mat(n,tmax);
for( int i = 0; i < n; ++i) {
NumericMatrix::Row zzrow = mat( i , _ );
std::vector<int> vec = out[i];
for( int j = 0; j < vec.size(); ++j ) {
zzrow[ (vec[j]-1) ] = 1; //don't forget R vs. C++ indexing
}
}
return mat;
}
使用OP中的名义示例,我们就可以做...
require(Rcpp)
## source the function so it is available to use in R
sourceCpp("C:/path/to/file.cpp")
# Call it like any other R function
expandR(DF$B)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 1 1 0 0 0 0 0 0 0
[2,] 1 1 1 0 0 1 0 0 0 0
[3,] 1 1 1 0 1 0 1 0 0 0
[4,] 0 1 1 1 1 1 1 0 0 0
[5,] 0 0 1 1 1 1 1 0 0 1
使用@Ricardo提供的更大的数据集)并与@ Ananda的解决方案进行比较)....
require(Rcpp)
require(data.table)
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF);
DT
## source in our c code
sourceCpp("C:/Users/sohanlon/Desktop/expandR2.cpp")
forLoop.Ananda <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol) ; for (i in 1:nrow(DF)) { m[i, temp[[i]]] = 1 }; m })
rcpp.Simon <- quote({mm = expandR( DT$B )})
require(microbenchmark)
microbenchmark( eval(forLoop.Ananda) , eval(rcpp.Simon) , times = 5L )
Unit: milliseconds
expr min lq median uq max neval
eval(forLoop.Ananda) 173.3024 178.6445 181.5881 218.9619 227.9490 5
eval(rcpp.Simon) 115.8309 116.3876 116.8125 119.1971 125.6504 5
答案 6 :(得分:0)
但这不是一个特别快速的解决方案,对于喜欢tidyverse
可能性的人可能有用:
DF %>%
mutate(B = str_split(B, fixed(","))) %>%
unnest() %>%
transmute(A,
var = as.numeric(B),
val = 1) %>%
complete(var = seq(min(var), max(var), 1), nesting(A)) %>%
spread(var, val, fill = 0)
A `1` `2` `3` `4` `5` `6` `7` `8` `9` `10`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 0 0 0 0 0 0 0
2 2 1 1 1 0 0 1 0 0 0 0
3 3 1 1 1 0 1 0 1 0 0 0
4 4 0 1 1 1 1 1 1 0 0 0
5 5 0 0 1 1 1 1 1 0 0 1
要具有更紧凑的列名:
DF %>%
mutate(B = str_split(B, fixed(","))) %>%
unnest() %>%
transmute(A,
var = as.numeric(B),
val = 1) %>%
complete(var = seq(min(var), max(var), 1), nesting(A)) %>%
spread(var, val, fill = 0) %>%
rename_at(2:length(.), ~ paste0("Col", 1:length(.)))
A Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 0 0 0 0 0 0 0
2 2 1 1 1 0 0 1 0 0 0 0
3 3 1 1 1 0 1 0 1 0 0 0
4 4 0 1 1 1 1 1 1 0 0 0
5 5 0 0 1 1 1 1 1 0 0 1