如何为永远独特的元素分配数字?

时间:2015-05-01 19:15:20

标签: r dataframe

我有一个像 -

这样的数据框
No.     Alphabet
 1.       A
 2.       B
 3.       A
 4.       A
 5.       C                 
 6.       B
 7.       C

现在,我想添加一个新的列结果,它会为每个唯一元素提供一个新数字。所以决赛桌将是

No.     Alphabet   Outcome
 1.       A           1
 2.       B           2
 3.       A           1
 4.       A           1    
 5.       C           3                     
 6.       B           2 
 7.       C           3

如何用R?

实现这一目标

3 个答案:

答案 0 :(得分:5)

您可以使用as.numeric(factor(.)),如下所示:

> Letter <- c("A", "A", "B", "C", "B", "A")
> as.numeric(factor(Letter))
[1] 1 1 2 3 2 1

可以使用标准mydf$outcome <- etc或您喜欢/首选的方法来分配列。

答案 1 :(得分:4)

您也可以

library(data.table)
setDT(df1)[, Outcome:= .GRP, Alphabet][]
#    No. Alphabet Outcome
#1:   1        A       1
#2:   2        B       2
#3:   3        A       1
#4:   4        A       1
#5:   5        C       3
#6:   6        B       2
#7:   7        C       3

基准

library(fastmatch)
set.seed(24)
df2 <- data.frame(No = 1:1e7, Alphabet= sample(LETTERS, 1e7, 
            replace=TRUE), stingsAsFactors=FALSE)
df3 <- copy(df2)
Ananda <- function() {transform(df2, 
             outcome = as.numeric(factor(df2$Alphabet)))}
Brodie <- function() {transform(df2, outcome=match(Alphabet, Alphabet))}
Brodie2 <- function(){transform(df2, outcome=fmatch(Alphabet, Alphabet))}

akrun <- function() {setDT(df3)[, Outcome:= .GRP, Alphabet][]}

library(microbenchmark)
microbenchmark(Ananda(), Brodie(), Brodie2(), akrun(), 
                    unit='relative', times=20L)
# Unit: relative
#    expr      min       lq     mean   median       uq      max neval cld
# Ananda() 4.957064 5.150724 4.427514 4.971581 3.336064 4.622502    20   c
# Brodie() 4.473689 5.074105 4.838985 5.383722 4.641304 4.383919    20   c
#Brodie2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20 a  
#  akrun() 1.609863 2.047646 1.665557 1.949590 1.331554 1.290921    20  b 


 system.time(akrun())
 #  user  system elapsed 
 # 0.197   0.005   0.202 

 system.time(Brodie2())
 #  user  system elapsed 
 # 0.081   0.014   0.095 

答案 2 :(得分:2)

我们假设您的数据框名为dat。然后就可以了

dat$Outcome <- as.numeric(as.factor(dat$Alphabet))