我正在做一个实体消歧项目,我有一个同名作者的数据框,其中包含以下列:author ID
和coauthor names
。
我需要找到作者ID所识别的作者与他/她曾经合作过的所有合着者之间的合作数量。
以下是数据框的示例:
author.ID coauthor.names
1 J Smith, A Greer
1 J Adams, J Smith
2 D Richardson, J Smith
我希望输出为:
author.ID coauthor.name collaboration.times
1 J Smith 2
1 J Adams 1
1 A Greer 1
2 D Richardson 1
2 J Smith 1
我尝试将作者的所有共同作者(用逗号分隔)和特定的作者ID组合成一个大字符串,我将在这个巨大的字符串中使用来自str_count
包的stringr
,但我不知道我是否正在解决这个问题。
是否有更有效或更优雅的方法来解决此问题?
感谢。
答案 0 :(得分:3)
假设你正在处理这样的数据:
mydf <- structure(list(author.ID = c(1L, 1L, 2L), coauthor.names = c("J Smith, A Greer",
"J Adams, J Smith", "D Richardson, J Smith")), .Names = c("author.ID",
"coauthor.names"), row.names = c(NA, 3L), class = "data.frame")
mydf
## author.ID coauthor.names
## 1 1 J Smith, A Greer
## 2 1 J Adams, J Smith
## 3 2 D Richardson, J Smith
...您可以从我的“splitstackshape”软件包中尝试cSplit
,然后从“data.table”聚合.N
:
library(splitstackshape)
cSplit(mydf, "coauthor.names", ",", "long")[
, list(collaboaration.times = .N), .(author.ID, coauthor.names)][]
# author.ID coauthor.names collaboaration.times
# 1: 1 J Smith 2
# 2: 1 A Greer 1
# 3: 1 J Adams 1
# 4: 2 D Richardson 1
# 5: 2 J Smith 1
假设你正在处理这样的数据:
mydf2 <- structure(list(author.ID = c(1L, 1L, 2L), coauthor.names = structure(list(
c("J Smith", "A Greer"), c("J Adams", "J Smith"), c("D Richardson",
"J Smith")), class = "AsIs")), .Names = c("author.ID", "coauthor.names"
), row.names = c(NA, 3L), class = "data.frame")
mydf2
## author.ID coauthor.names
## 1 1 J Smith,....
## 2 1 J Adams,....
## 3 2 D Richar....
...你可以从listCol_l
开始(再次从“splitstackshape”开始)然后以相同的方式计数。
listCol_l(mydf2, "coauthor.names")[
, list(collaboration.times = .N), .(author.ID, coauthor.names_ul)]
# author.ID coauthor.names_ul collaboration.times
# 1: 1 J Smith 2
# 2: 1 A Greer 1
# 3: 1 J Adams 1
# 4: 2 D Richardson 1
# 5: 2 J Smith 1
“tidyverse”等价物可能是这样的:
library(tidyverse)
# For a single character string as "coauthor.names"
mydf %>%
mutate(coauthor.names = lapply(strsplit(coauthor.names, ","), trimws)) %>%
unnest() %>%
group_by(author.ID, coauthor.names) %>%
summarise(collaboration.times = n())
# If "coauthor.names" is already a `list`.
mydf2 %>%
unnest() %>%
group_by(author.ID, coauthor.names) %>%
summarise(collaboration.times = n())