我正在尝试获取一列数据(D)并将每个值转换为新的列标题。然后我需要将相应的“E”值放入新列中。例如:
A B C D E
Elm 1.1 Tree AB10_A 1
Oak 1.2 Tree AB10_A 1
Yew 1.3 Tree AB10_B 2
Maple1.4 Tree AB10_B 1
Ash 1.5 Tree AB10_B 1
Elm 1.6 Tree AB10_C 1
Maple1.7 Tree AB10_C 1
Ash 1.8 Tree AB10_D 3
Oak 1.9 Tree AB10_E 3
变为:
A B C AB10_A AB10_B AB10_C AB10_D AB10_E
Elm 1.1 Tree 1
Oak 1.2 Tree 1
Yew 1.3 Tree 2
Maple1.4 Tree 1
Ash 1.5 Tree 1
Elm 1.6 Tree 1
Maple1.7 Tree 1
Ash 1.8 Tree 3
Oak 1.9 Tree 3
我的数据集非常大,有超过2000个唯一的D值。欢迎任何建议!对不起我的桌子太糟糕了......
答案 0 :(得分:3)
你想要这样的东西:
# your data
mydf <-
read.table(text=' A B C D E
Elm 1.1 Tree AB10_A 1
Oak 1.2 Tree AB10_A 1
Yew 1.3 Tree AB10_B 2
Maple 1.4 Tree AB10_B 1
Ash 1.5 Tree AB10_B 1
Elm 1.6 Tree AB10_C 1
Maple 1.7 Tree AB10_C 1
Ash 1.8 Tree AB10_D 3
Oak 1.9 Tree AB10_E 3', header=TRUE, stringsAsFactors=FALSE)
cbind(mydf, model.matrix(~0+D, data=mydf)*mydf$E)
A B C D E DAB10_A DAB10_B DAB10_C DAB10_D DAB10_E
1 Elm 1.1 Tree AB10_A 1 1 0 0 0 0
2 Oak 1.2 Tree AB10_A 1 1 0 0 0 0
3 Yew 1.3 Tree AB10_B 2 0 2 0 0 0
4 Maple 1.4 Tree AB10_B 1 0 1 0 0 0
5 Ash 1.5 Tree AB10_B 1 0 1 0 0 0
6 Elm 1.6 Tree AB10_C 1 0 0 1 0 0
7 Maple 1.7 Tree AB10_C 1 0 0 1 0 0
8 Ash 1.8 Tree AB10_D 3 0 0 0 3 0
9 Oak 1.9 Tree AB10_E 3 0 0 0 0 3
基本上,model.matrix
生成与向量(或多个向量)的唯一值对应的指标变量,然后您只需将该矩阵乘以E
列,以移动{{1}的相关值进入那些新的虚拟列。显然你可以重命名这些变量,使它们前面加上“D”,但我认为这既简单又不是那么大。
要了解这里发生了什么,请在我们乘以E
之前,先查看model.matrix
部分的输出:
cbind
你看到了指标。其中关键部分是将模型表示为公式对象。在这种情况下,将> model.matrix(~0+D, data=mydf)
DAB10_A DAB10_B DAB10_C DAB10_D DAB10_E
1 1 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 0 1 0 0 0
6 0 0 1 0 0
7 0 0 1 0 0
8 0 0 0 1 0
9 0 0 0 0 1
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$D
[1] "contr.treatment"
转换为指标。如果您没有D
部分,则0+
的一个级别将被视为回归模型中的基线:
D
与Ananda的解决方案相比,还有一些基准测试:
> model.matrix(~D, data=mydf)
(Intercept) DAB10_B DAB10_C DAB10_D DAB10_E
1 1 0 0 0 0
2 1 0 0 0 0
3 1 1 0 0 0
4 1 1 0 0 0
5 1 1 0 0 0
6 1 0 1 0 0
7 1 0 1 0 0
8 1 0 0 1 0
9 1 0 0 0 1
attr(,"assign")
[1] 0 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$D
[1] "contr.treatment"
答案 1 :(得分:3)
您还可以使用reshape
:
reshape(df, v.names="E", direction="wide", timevar="D", idvar=c("A", "B", "C"))
产生:
A B C E.AB10_A E.AB10_B E.AB10_C E.AB10_D E.AB10_E
1 Elm 1.1 Tree 1 NA NA NA NA
2 Oak 1.2 Tree 1 NA NA NA NA
3 Yew 1.3 Tree NA 2 NA NA NA
4 Maple 1.4 Tree NA 1 NA NA NA
5 Ash 1.5 Tree NA 1 NA NA NA
6 Elm 1.6 Tree NA NA 1 NA NA
7 Maple 1.7 Tree NA NA 1 NA NA
8 Ash 1.8 Tree NA NA NA 3 NA
9 Oak 1.9 Tree NA NA NA NA 3
或者,使用包reshape2
:
dcast(df, A + B + C ~ D, value.var="E", fill="")
行的结果顺序不一样,但基本上相同且更简单。
答案 2 :(得分:2)
我实际上也会考虑以下手动方法:
myFun <- function(indf, colvar = "D", valvar = "E", fill = 0) {
## Get the unique values in the "colvar" variable
X <- unique(indf[, colvar])
## Create an empty matrix preallocated with whatever you
## desire as the "fill" value
M <- matrix(fill, ncol = length(X), nrow = nrow(indf),
dimnames = list(NULL, X))
## Use matrix indexing to *quickly* replace values in the
## matrix with values from whichever column you specify
M[cbind(sequence(nrow(indf)), match(indf[, colvar], X))] <- indf[, valvar]
M
}
上面的函数只是创建一个空列,其列数与“colvar”指定的列中的唯一值相同,并使用“valvar”中指定的列填充此矩阵中的相关值
cbind(mydf, myFun(mydf))
# A B C D E AB10_A AB10_B AB10_C AB10_D AB10_E
# 1 Elm 1.1 Tree AB10_A 1 1 0 0 0 0
# 2 Oak 1.2 Tree AB10_A 1 1 0 0 0 0
# 3 Yew 1.3 Tree AB10_B 2 0 2 0 0 0
# 4 Maple 1.4 Tree AB10_B 1 0 1 0 0 0
# 5 Ash 1.5 Tree AB10_B 1 0 1 0 0 0
# 6 Elm 1.6 Tree AB10_C 1 0 0 1 0 0
# 7 Maple 1.7 Tree AB10_C 1 0 0 1 0 0
# 8 Ash 1.8 Tree AB10_D 3 0 0 0 3 0
# 9 Oak 1.9 Tree AB10_E 3 0 0 0 0 3
上述功能在较大的数据集上表现也相当不错。
## 10K rows, 2K unique values in column "D"
set.seed(1)
bigDf <- data.frame(A = sample(LETTERS, 10000, TRUE),
B = sample(letters, 10000, TRUE),
C = "Tree",
D = sample(2000, 10000, TRUE),
E = sample(5, 10000, TRUE),
ID = 1:10000,
stringsAsFactors = FALSE)
system.time(myFun(bigDf))
# user system elapsed
# 0.303 0.056 0.371