R中的高级数据操作?

时间:2014-03-28 10:57:44

标签: r dataset

我正在尝试获取一列数据(D)并将每个值转换为新的列标题。然后我需要将相应的“E”值放入新列中。例如:

 A    B      C        D     E
Elm  1.1    Tree    AB10_A  1
Oak  1.2    Tree    AB10_A  1
Yew  1.3    Tree    AB10_B  2
Maple1.4    Tree    AB10_B  1
Ash  1.5    Tree    AB10_B  1
Elm  1.6    Tree    AB10_C  1
Maple1.7    Tree    AB10_C  1
Ash  1.8    Tree    AB10_D  3
Oak  1.9    Tree    AB10_E  3

变为:

A    B   C  AB10_A  AB10_B  AB10_C  AB10_D  AB10_E
Elm 1.1 Tree    1               
Oak 1.2 Tree    1               
Yew 1.3 Tree          2         
Maple1.4 Tree         1         
Ash 1.5 Tree          1         
Elm 1.6 Tree                  1     
Maple1.7 Tree                 1     
Ash 1.8 Tree                          3 
Oak 1.9 Tree                                   3

我的数据集非常大,有超过2000个唯一的D值。欢迎任何建议!对不起我的桌子太糟糕了......

3 个答案:

答案 0 :(得分:3)

你想要这样的东西:

# your data
mydf <- 
read.table(text=' A    B      C        D     E
Elm  1.1    Tree    AB10_A  1
Oak  1.2    Tree    AB10_A  1
Yew  1.3    Tree    AB10_B  2
Maple 1.4    Tree    AB10_B  1
Ash  1.5    Tree    AB10_B  1
Elm  1.6    Tree    AB10_C  1
Maple 1.7    Tree    AB10_C  1
Ash  1.8    Tree    AB10_D  3
Oak  1.9    Tree    AB10_E  3', header=TRUE, stringsAsFactors=FALSE)

cbind(mydf, model.matrix(~0+D, data=mydf)*mydf$E)

      A   B    C      D E DAB10_A DAB10_B DAB10_C DAB10_D DAB10_E
1   Elm 1.1 Tree AB10_A 1       1       0       0       0       0
2   Oak 1.2 Tree AB10_A 1       1       0       0       0       0
3   Yew 1.3 Tree AB10_B 2       0       2       0       0       0
4 Maple 1.4 Tree AB10_B 1       0       1       0       0       0
5   Ash 1.5 Tree AB10_B 1       0       1       0       0       0
6   Elm 1.6 Tree AB10_C 1       0       0       1       0       0
7 Maple 1.7 Tree AB10_C 1       0       0       1       0       0
8   Ash 1.8 Tree AB10_D 3       0       0       0       3       0
9   Oak 1.9 Tree AB10_E 3       0       0       0       0       3

基本上,model.matrix生成与向量(或多个向量)的唯一值对应的指标变量,然后您只需将该矩阵乘以E列,以移动{{1}的相关值进入那些新的虚拟列。显然你可以重命名这些变量,使它们前面加上“D”,但我认为这既简单又不是那么大。

要了解这里发生了什么,请在我们乘以E之前,先查看model.matrix部分的输出:

cbind

你看到了指标。其中关键部分是将模型表示为公式对象。在这种情况下,将> model.matrix(~0+D, data=mydf) DAB10_A DAB10_B DAB10_C DAB10_D DAB10_E 1 1 0 0 0 0 2 1 0 0 0 0 3 0 1 0 0 0 4 0 1 0 0 0 5 0 1 0 0 0 6 0 0 1 0 0 7 0 0 1 0 0 8 0 0 0 1 0 9 0 0 0 0 1 attr(,"assign") [1] 1 1 1 1 1 attr(,"contrasts") attr(,"contrasts")$D [1] "contr.treatment" 转换为指标。如果您没有D部分,则0+的一个级别将被视为回归模型中的基线:

D

与Ananda的解决方案相比,还有一些基准测试:

> model.matrix(~D, data=mydf)
  (Intercept) DAB10_B DAB10_C DAB10_D DAB10_E
1           1       0       0       0       0
2           1       0       0       0       0
3           1       1       0       0       0
4           1       1       0       0       0
5           1       1       0       0       0
6           1       0       1       0       0
7           1       0       1       0       0
8           1       0       0       1       0
9           1       0       0       0       1
attr(,"assign")
[1] 0 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$D
[1] "contr.treatment"

答案 1 :(得分:3)

您还可以使用reshape

reshape(df, v.names="E", direction="wide", timevar="D", idvar=c("A", "B", "C"))

产生:

      A   B    C E.AB10_A E.AB10_B E.AB10_C E.AB10_D E.AB10_E
1   Elm 1.1 Tree        1       NA       NA       NA       NA
2   Oak 1.2 Tree        1       NA       NA       NA       NA
3   Yew 1.3 Tree       NA        2       NA       NA       NA
4 Maple 1.4 Tree       NA        1       NA       NA       NA
5   Ash 1.5 Tree       NA        1       NA       NA       NA
6   Elm 1.6 Tree       NA       NA        1       NA       NA
7 Maple 1.7 Tree       NA       NA        1       NA       NA
8   Ash 1.8 Tree       NA       NA       NA        3       NA
9   Oak 1.9 Tree       NA       NA       NA       NA        3

或者,使用包reshape2

dcast(df, A + B + C ~ D, value.var="E", fill="") 

行的结果顺序不一样,但基本上相同且更简单。

答案 2 :(得分:2)

我实际上也会考虑以下手动方法:

myFun <- function(indf, colvar = "D", valvar = "E", fill = 0) {

  ## Get the unique values in the "colvar" variable
  X <- unique(indf[, colvar])

  ## Create an empty matrix preallocated with whatever you
  ##   desire as the "fill" value
  M <- matrix(fill, ncol = length(X), nrow = nrow(indf), 
              dimnames = list(NULL, X))

  ## Use matrix indexing to *quickly* replace values in the
  ##   matrix with values from whichever column you specify
  M[cbind(sequence(nrow(indf)), match(indf[, colvar], X))] <- indf[, valvar]
  M
}

上面的函数只是创建一个空列,其列数与“colvar”指定的列中的唯一值相同,并使用“valvar”中指定的列填充此矩阵中的相关值

cbind(mydf, myFun(mydf))
#       A   B    C      D E AB10_A AB10_B AB10_C AB10_D AB10_E
# 1   Elm 1.1 Tree AB10_A 1      1      0      0      0      0
# 2   Oak 1.2 Tree AB10_A 1      1      0      0      0      0
# 3   Yew 1.3 Tree AB10_B 2      0      2      0      0      0
# 4 Maple 1.4 Tree AB10_B 1      0      1      0      0      0
# 5   Ash 1.5 Tree AB10_B 1      0      1      0      0      0
# 6   Elm 1.6 Tree AB10_C 1      0      0      1      0      0
# 7 Maple 1.7 Tree AB10_C 1      0      0      1      0      0
# 8   Ash 1.8 Tree AB10_D 3      0      0      0      3      0
# 9   Oak 1.9 Tree AB10_E 3      0      0      0      0      3

上述功能在较大的数据集上表现也相当不错。

## 10K rows, 2K unique values in column "D"
set.seed(1)
bigDf <- data.frame(A = sample(LETTERS, 10000, TRUE),
                    B = sample(letters, 10000, TRUE),
                    C = "Tree",
                    D = sample(2000, 10000, TRUE),
                    E = sample(5, 10000, TRUE),
                    ID = 1:10000,
                    stringsAsFactors = FALSE)

system.time(myFun(bigDf))
#    user  system elapsed 
#   0.303   0.056   0.371