我有一个包含200万行的data.frame。其中一列是字母数字Id,在该列中重复,唯一计数为300000?
>head(df$ID)
ID
AB00153232de
AB00153232de
AB00153232de
AB00155532gh
AB00155532gh
AB00158932ij
>df$ID<-factor(df$ID)
当我尝试打印该因子变量时,我会得到这样的结果:
>df$ID
[1] AB00153232de AB00153232de AB00153232de AB00155532gh AB00155532gh AB00158932ij
320668 Levels: AB00153232de AB00155532gh AB00158932ij.....
该因素是否未存储为数字向量?为什么?
答案 0 :(得分:1)
在因子变量上使用unclass
。它将因子级别保持为新变量的属性,因此如果您将来需要它,可以使用它。
df1$ID
# [1] AB00153232de AB00153232de AB00153232de AB00155532gh AB00155532gh AB00158932ij
# Levels: AB00153232de AB00155532gh AB00158932ij
unclass(df1$ID)
# [1] 1 1 1 2 2 3
# attr(,"levels")
# [1] "AB00153232de" "AB00155532gh" "AB00158932ij"
数据:强>
df1 <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 3L),
.Label = c("AB00153232de", "AB00155532gh", "AB00158932ij"), class = "factor")),
.Names = "ID", row.names = c(NA, -6L), class = "data.frame")
答案 1 :(得分:0)
请改用as.integer(df$ID)
。
示例:
R> ex <- as.factor(LETTERS)
R> ex
[1] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
R> str(ex)
Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
R> as.integer(ex)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
R>