基于列值

时间:2017-01-05 07:51:30

标签: r conditional-statements

我有一个数据集,从2012年开始有600,000份科学论文,2014年有600,000份。我使用文章夫妇(2014-2012)作为我的分析单位,用于引文分析等。

我列出了所有具有引文链接的文章(从2014年到2012年),以及我想要的是,每次2014-doc引用2012-doc(Cit = 1),另一个2012-doc,没有被2014-doc引用,但来自与原版相同的期刊。

玩具示例:

Citing <- data.frame(T2012=c("DOI1", "DOI2", "DOI3"), 
                     S2014=c("DOIa", "DOIb", "DOIc"), 
                     journal2012=c("Nature", "Science", "JoE"), 
                     cit=c(1,1,1))


Docs2012 <- data.frame(T2012=c("DOI1", "DOI2", "DOI3", "DOI4", "DOI5", "DOI6", 
                               "DOI7", "DOI8", "DOI9", "DOI10", "DOI11", "DOI12", 
                               "DOI13"), 
                      Journal=c("Nature", "Science", "JoE", "Nature", "Nature", 
                                "JoE", "Science", "JoE", "Nature", "Science", 
                                "Science", "JoE", "Science"))

...现在我想为每个Cit = 1添加3行,其中S2014和journal2012保持不变,Cit = 0和T2012是来自同一期刊的随机DOI,与上面的Cit = 1个案例相同。我已经尝试过复杂的循环来绘制T2012,但考虑到我的数据集的大小,他们需要几天。这就是我想要的结果:

Citing_withcontrol <- data.frame(T2012=c("DOI1", "DOI5", "DOI4", "DOI9", "DOI2",
                                         "DOI13", "DOI7", "DOI11", "DOI3", "DOI8", 
                                         "DOI6", "DOI12"),
                                 S2014=c("DOIa", "DOIa", "DOIa", "DOIa", 
                                         "DOIb", "DOIb", "DOIb", "DOIb", 
                                         "DOIc", "DOIc", "DOIc", "DOIc"), 
                                 journal2012=c("Nature", "Nature", "Nature", 
                                               "Nature", "Science", "Science", 
                                               "Science", "Science", "JoE", "JoE", 
                                               "JoE", "JoE"),
                                 cit=c(1,0,0,0,1,0,0,0,1,0,0,0))

非常感谢帮助

1 个答案:

答案 0 :(得分:1)

使用dplyr

的一个想法
library(dplyr)
merge(Docs2012 ,Citing, by.x = c('T2012', 'Journal'), by.y = c('T2012', 'journal2012'), all = TRUE)%>% 
   arrange(Journal, S2014) %>% 
   group_by(Journal) %>% 
   mutate(S2014 = zoo::na.locf(S2014), cit = replace(cit, is.na(cit), 0)) %>% 
   sample_n(4) %>%
   arrange(S2014, Journal, desc(cit)) %>%
   ungroup()

# A tibble: 12 × 4
#    T2012  S2014 Journal   cit
#   <fctr> <fctr>  <fctr> <dbl>
#1    DOI1   DOIa  Nature     1
#2    DOI4   DOIa  Nature     0
#3    DOI5   DOIa  Nature     0
#4    DOI9   DOIa  Nature     0
#5    DOI2   DOIb Science     1
#6   DOI10   DOIb Science     0
#7    DOI7   DOIb Science     0
#8   DOI11   DOIb Science     0
#9    DOI3   DOIc     JoE     1
#10  DOI12   DOIc     JoE     0
#11   DOI6   DOIc     JoE     0
#12   DOI8   DOIc     JoE     0

<强>解释

  • 我们首先mergeT2012上的两个数据框&amp; Journal/journal2012
  • 我们将结果数据框排在Journal&amp; S2014group_by日记。
  • 我们为zoo::na.locf变量提供最新的非NA值(使用S2014),并将所有NA替换为0变量中的cit
  • 我们使用sample_n来取样(在您的情况下为4)
  • 我们arrangeungroup获得所需的输出