数据集包含三个变量:id,sex和grade(factor)。
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
对于每个ID,我需要查看我们拥有多少个唯一等级,然后创建一个新列(调用N)来记录成绩频率。例如,对于ID = 1,我们有&#34; grade&#34;的五个唯一值,因此N = 4;对于ID = 2,我们有&#34;等级&#34;的两个唯一值,所以N = 2;对于ID = 4,我们有两个独特的值&#34; grade&#34; (忽略NA),所以N = 2。
最终数据集是
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4), sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q","q","q","q", "a", "a", "a", NA, "b"))
mydata$N <- c(5,5,5,5,5,2,2,2,2,1,1,1,1,2,2,2,2,2)
答案 0 :(得分:11)
新答案:
uniqueN
- data.table的函数有一个na.rm
参数,我们可以按如下方式使用:
library(data.table)
setDT(mydata)[, n := uniqueN(grade, na.rm = TRUE), by = id]
给出:
> mydata id sex grade n 1: 1 1 a 5 2: 1 1 b 5 3: 1 1 c 5 4: 1 1 d 5 5: 1 1 e 5 6: 2 0 x 2 7: 2 0 y 2 8: 2 0 y 2 9: 2 0 x 2 10: 3 0 q 1 11: 3 0 q 1 12: 3 0 q 1 13: 3 0 q 1 14: 4 1 a 2 15: 4 1 a 2 16: 4 1 a 2 17: 4 1 NA 2 18: 4 1 b 2
旧回答:
使用 data.table ,您可以按以下方式执行此操作:
library(data.table)
setDT(mydata)[, n := uniqueN(grade[!is.na(grade)]), by = id]
或:
setDT(mydata)[, n := uniqueN(na.omit(grade)), by = id]
答案 1 :(得分:9)
您可以使用包data.table
:
library(data.table)
setDT(mydata)
#I have removed NA's, up to you how to count them
mydata[,N_u:=length(unique(grade[!is.na(grade)])),by=id]
非常简短,可读且快速。它也可以在base-R中完成:
#lapply(split(grade,id),...: splits data into subsets by id
#unlist: creates one vector out of multiple vectors
#rep: makes sure each ID is repeated enough times
mydata$N <- unlist(lapply(split(mydata$grade,mydata$id),function(x){
rep(length(unique(x[!is.na(x)])),length(x))
}
))
因为有关于什么更快的讨论,让我们做一些基准测试。
给定数据集:
> test1
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 3.043186 3.161732 3.422327 3.286436 3.477854 10.627030 100 b
uniqueN 2.481761 2.615190 2.763192 2.738354 2.872809 3.985393 100 a
更大的数据集:(10000个观测值,1000个id)
> test2
Unit: milliseconds
expr min lq mean median uq max neval cld
length_unique 11.84123 24.47122 37.09234 30.34923 47.55632 97.63648 100 a
uniqueN 25.83680 50.70009 73.78757 62.33655 97.33934 210.97743 100 b
答案 2 :(得分:7)
使用dplyr::n_distinct
及其na.rm
参数的dplyr选项:
library(dplyr)
mydata %>% group_by(id) %>% mutate(N = n_distinct(grade, na.rm = TRUE))
#Source: local data frame [18 x 4]
#Groups: id [4]
#
# id sex grade N
# (dbl) (dbl) (fctr) (int)
#1 1 1 a 5
#2 1 1 b 5
#3 1 1 c 5
#4 1 1 d 5
#5 1 1 e 5
#6 2 0 x 2
#7 2 0 y 2
#8 2 0 y 2
#9 2 0 x 2
#10 3 0 q 1
#11 3 0 q 1
#12 3 0 q 1
#13 3 0 q 1
#14 4 1 a 2
#15 4 1 a 2
#16 4 1 a 2
#17 4 1 NA 2
#18 4 1 b 2
答案 3 :(得分:5)
看起来我们对data.table
有多个投票,但你也可以使用基础R函数ave()
:
mydata$N <- ave(as.character(mydata$grade),mydata$id,
FUN = function(x) length(unique(x[!is.na(x)])))
答案 4 :(得分:4)
使用tapply和查找表
mydata <- data.frame(id=c(1,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4),
sex=c(1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1),
grade=c("a","b","c","d","e", "x","y","y","x", "q",
"q","q","q", "a", "a", "a", NA, "b"))
uniqN <- tapply(mydata$grade, mydata$id, function(x) sum(!is.na(unique(x))))
mydata$N <- uniqN[mydata$id]
答案 5 :(得分:0)
这是一个dplyr方法。由于整洁的原因,我将摘要表分开。
library(dplyr)
summary =
mydata %>%
distinct(id, grade) %>%
filter(grade %>% is.na %>% `!`) %>%
count(id)
mydata %>%
left_join(summary)