我有一个数据集,从2012年开始有600,000份科学论文,2014年有600,000份。我使用文章夫妇(2014-2012)作为我的分析单位,用于引文分析等。
我列出了所有具有引文链接的文章(从2014年到2012年),以及我想要的是,每次2014-doc引用2012-doc(Cit = 1),另一个2012-doc,没有被2014-doc引用,但来自与原版相同的期刊。
玩具示例:
Citing <- data.frame(T2012=c("DOI1", "DOI2", "DOI3"),
S2014=c("DOIa", "DOIb", "DOIc"),
journal2012=c("Nature", "Science", "JoE"),
cit=c(1,1,1))
Docs2012 <- data.frame(T2012=c("DOI1", "DOI2", "DOI3", "DOI4", "DOI5", "DOI6",
"DOI7", "DOI8", "DOI9", "DOI10", "DOI11", "DOI12",
"DOI13"),
Journal=c("Nature", "Science", "JoE", "Nature", "Nature",
"JoE", "Science", "JoE", "Nature", "Science",
"Science", "JoE", "Science"))
...现在我想为每个Cit = 1添加3行,其中S2014和journal2012保持不变,Cit = 0和T2012是来自同一期刊的随机DOI,与上面的Cit = 1个案例相同。我已经尝试过复杂的循环来绘制T2012,但考虑到我的数据集的大小,他们需要几天。这就是我想要的结果:
Citing_withcontrol <- data.frame(T2012=c("DOI1", "DOI5", "DOI4", "DOI9", "DOI2",
"DOI13", "DOI7", "DOI11", "DOI3", "DOI8",
"DOI6", "DOI12"),
S2014=c("DOIa", "DOIa", "DOIa", "DOIa",
"DOIb", "DOIb", "DOIb", "DOIb",
"DOIc", "DOIc", "DOIc", "DOIc"),
journal2012=c("Nature", "Nature", "Nature",
"Nature", "Science", "Science",
"Science", "Science", "JoE", "JoE",
"JoE", "JoE"),
cit=c(1,0,0,0,1,0,0,0,1,0,0,0))
非常感谢帮助
答案 0 :(得分:1)
使用dplyr
,
library(dplyr)
merge(Docs2012 ,Citing, by.x = c('T2012', 'Journal'), by.y = c('T2012', 'journal2012'), all = TRUE)%>%
arrange(Journal, S2014) %>%
group_by(Journal) %>%
mutate(S2014 = zoo::na.locf(S2014), cit = replace(cit, is.na(cit), 0)) %>%
sample_n(4) %>%
arrange(S2014, Journal, desc(cit)) %>%
ungroup()
# A tibble: 12 × 4
# T2012 S2014 Journal cit
# <fctr> <fctr> <fctr> <dbl>
#1 DOI1 DOIa Nature 1
#2 DOI4 DOIa Nature 0
#3 DOI5 DOIa Nature 0
#4 DOI9 DOIa Nature 0
#5 DOI2 DOIb Science 1
#6 DOI10 DOIb Science 0
#7 DOI7 DOIb Science 0
#8 DOI11 DOIb Science 0
#9 DOI3 DOIc JoE 1
#10 DOI12 DOIc JoE 0
#11 DOI6 DOIc JoE 0
#12 DOI8 DOIc JoE 0
<强>解释强>
merge
列T2012
上的两个数据框&amp; Journal/journal2012
Journal
&amp; S2014
和group_by
日记。zoo::na.locf
变量提供最新的非NA值(使用S2014
),并将所有NA
替换为0
变量中的cit
。 sample_n
来取样(在您的情况下为4)arrange
和ungroup
获得所需的输出