我有这个例子:df.Journal.Conferences
venue author0 author1 author2 ... author19
A John Mary
B Peter Jacob Isabella
C Lia
B Jacob Lara John
C Mary
B Isabella
我想知道每个场地有多少独特的作者
结果:
A 2
B 5
C 2
修改 以下是我的数据的链接:GoogleDrive Excel sheet。
答案 0 :(得分:0)
因为您的数据很难重现,所以我生成了一个类似的"数据集, 这应该是
set.seed(1984)
df <- data.frame(id = sample(1:5,10, replace= T),
v1 = sample(letters[1:5],10,replace= T),
v2 = sample(letters[1:5],10,replace= T),
v3 = sample(letters[1:5],10,replace= T),
v4 = sample(letters[1:5],10,replace= T),
stringsAsFactors = F)
z <- data.frame( id = unique(df$id), n = NA )
for (i in z$id) {
z$n[z$id == i] <- length(unique(unlist(df[df$id == i,-1])))
}
z
# id n
# 1 4 4
# 2 3 4
# 3 2 4
# 4 5 4
# 5 1 3
答案 1 :(得分:0)
使用dplyr和tidyr,将数据从宽到长整形,然后按计数分组。
library(dplyr)
library(tidyr)
gather(df1, key = author, value = name, -venue) %>%
select(venue, name) %>%
group_by(venue) %>%
summarise(n = n_distinct(name, na.rm = TRUE))
# # A tibble: 3 × 2
# venue n
# <chr> <int>
# 1 A 2
# 2 B 5
# 3 C 2
df1 <- read.table(text ="
venue,author0,author1,author2
A,John,Mary,NA
B,Peter,Jacob,Isabella
C,Lia,NA,NA
B,Jacob,Lara,John
C,Mary,NA,NA
B,Isabella,NA,NA
", header = TRUE, sep = ",", stringsAsFactors = FALSE)
修改:将Excel工作表保存为CSV,然后使用read.csv读入,然后上面的代码返回以下输出:
df1 <- read.csv("Journal_Conferences_Authors.csv", na.strings = "#N/A")
# output
# # A tibble: 427 × 2
# venue n
# <fctr> <int>
# 1 AAAI 4
# 2 ACC 4
# 3 ACIS-ICIS 5
# 4 ACM SIGSOFT Software Engineering Notes 1
# 5 ACM Southeast Regional Conference 5
# 6 ACM TIST 3
# 7 ACM Trans. Comput.-Hum. Interact. 3
# 8 ACML 2
# 9 ADMA 2
# 10 Advanced Visual Interfaces 3
# # ... with 417 more rows
答案 2 :(得分:0)
使用@ zx8754数据进行测试,此代码给出了您想要的(假设您对数据帧中的空单元格有NA):
package.json