对于下面的数据框,有59列
circleid name birthday 56 more...
1 1 1
2 2 10
2 5 68
2 1 10
1 1 1
我想要的结果
circleid distinct_name distinct_birthday 56 more...
1 1 1
2 3 2
quiz <- read.csv("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)
到目前为止
ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(name)))
这适用于1列如何获取完整数据帧
columns <- colnames(quiz)
for (i in c(1:58)
{
final <- ddply(quiz,~circleid,summarise,number_of_distinct_name=length(unique(columns[i])))
}
答案 0 :(得分:1)
使用data.table
即可运行:
library(data.table)
quiz <- fread("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/circles-removed-na.csv", header = T)
unique_vals <- quiz[, lapply(.SD, uniqueN), by = circleid]
答案 1 :(得分:1)
使用包dplyr
,这很简单。原始答案为length(unique(.))
,但@akrun在评论中将我指向n_distinct(.)
。
library(dplyr)
quiz %>%
group_by(circleid) %>%
summarise_all(n_distinct)
## A tibble: 2 x 3
#circleid name birthday
#<int> <int> <int>
# 1 1 1
# 2 2 3
数据。
quiz <- read.table(text = "
circleid name birthday
1 1 1
2 2 10
2 5 68
2 1 10
1 1 1
", header = TRUE)
答案 2 :(得分:1)
您可以使用dplyr
:
result<-quiz%>%
group_by(circleid)%>%
summarise_all(n_distinct)
microbenchmark
data.table
和dplyr
:
microbenchmark(x1=quiz[, lapply(.SD, function(x) length(unique(x))), by = circleid],
x2=quiz%>%
group_by(circleid)%>%
summarise_all(n_distinct),times=100)
Unit: milliseconds
expr min lq mean median uq max neval cld
x1 150.06392 155.02227 158.75775 156.49328 158.38887 224.22590 100 b
x2 41.07139 41.90953 42.95186 42.54135 43.97387 49.91495 100 a