在两个因子水平的所有组合上循环(或应用?)

时间:2018-02-21 16:34:51

标签: r combinatorics tidyverse

我的长数据框的前9行如下所示:

  ptid<-c(rep(3, 3), rep(4, 3), rep(5, 3))
  reviewer<-c("d", "b", "f", "a", "e", "c", "a", "f", "b")
  outcome<-c(rep("Yes", 2), rep("No", 4), rep("Yes", 3))
  dta <- data.frame(ptid, reviewer, outcome, stringsAsFactors=FALSE)

ptid的每个值在dta内重复3次 reviewer变量a:f有6个值,结果变量是二进制的,没有任何遗漏数据。

我做了一个看起来像这样的空矩阵:

mat<-matrix(, nrow=6, ncol=6, dimnames = list(letters[1:6], letters[1:6]))

我想用每个评论家组合达成一致的次数填充矩阵的下半部分。我可以使用以下代码在矩阵中对位置[a,b]执行此操作:

combo<-dta[which(dta$reviewer=="a" | dta$reviewer=="b"), c("ptid", "reviewer", "outcome")]
wide<-combo %>% spread(reviewer, outcome)
wide<-na.omit(wide)
ab_agree<-table(wide$a==wide$b)[2]
ab<-paste(signif((ab_agree/dim(wide)[1])*100, 3), "%", sep="")
mat["a", "b"]<-ab

我非常感谢帮助编写一个以尽可能最有效的方式填充矩阵每列的函数。我最初开始编写一个for循环,但我感觉apply有更好的方法。

1 个答案:

答案 0 :(得分:1)

我没有使用applyfor - 循环,但来自dplyrtidyr的功能(因为您已经使用了这些包)。

library(dplyr)
library(tidyr)

首先,我获得所有评论者对的所有组合:

reviewer_combos <- tibble(reviewer1 = letters[1:6], reviewer2 = letters[1:6]) %>% 
  complete(reviewer1, reviewer2) %>% 
  filter(reviewer1 != reviewer2)

reviewer_combos
# A tibble: 30 x 2
#    reviewer1 reviewer2
#    <chr>     <chr>    
#  1 a         b        
#  2 a         c        
#  3 a         d        
#  4 a         e        
#  5 a         f        
#  6 b         a        
#  7 b         c        
#  8 b         d        
#  9 b         e        
# 10 b         f        
# ... with 20 more rows

然后将reviwer1和reviwer2的答案加入到包含两个left_join的数据中,并确定它们是否同意:

reviewer_combos <- reviewer_combos %>% 
  left_join(dta, by = c("reviewer1" = "reviewer")) %>% 
  left_join(dta, by = c("reviewer2" = "reviewer"), suffix = c("", "2")) %>% 
  mutate(agree = as.integer(outcome == outcome2))

reviewer_combos
# A tibble: 66 x 7
#    reviewer1 reviewer2  ptid outcome ptid2 outcome2 agree
#    <chr>     <chr>     <dbl> <chr>   <dbl> <chr>    <int>
#  1 a         b          4.00 No       3.00 Yes          0
#  2 a         b          4.00 No       5.00 Yes          0
#  3 a         b          5.00 Yes      3.00 Yes          1
#  4 a         b          5.00 Yes      5.00 Yes          1
#  5 a         c          4.00 No       4.00 No           1
#  6 a         c          5.00 Yes      4.00 No           0
#  7 a         d          4.00 No       3.00 Yes          0
#  8 a         d          5.00 Yes      3.00 Yes          1
#  9 a         e          4.00 No       4.00 No           1
# 10 a         e          5.00 Yes      4.00 No           0
# ... with 56 more rows

最后使用group_bysummarize确定每个修订群组中的协议百分比,并以spread所希望的格式显示:

reviewer_percentage <- reviewer_combos %>% 
  group_by(reviewer1, reviewer2) %>% 
  summarize(percentage_agree = sum(agree)/n()) %>% 
  spread(reviewer2, percentage_agree) 

reviewer_percentage
# A tibble: 6 x 7
# Groups:   reviewer1 [6]
# reviewer1      a      b      c      d      e      f
# * <chr>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
# 1 a         NA      0.500  0.500  0.500  0.500  0.500
# 2 b          0.500 NA      0      1.00   0      0.500
# 3 c          0.500  0     NA      0      1.00   0.500
# 4 d          0.500  1.00   0     NA      0      0.500
# 5 e          0.500  0      1.00   0     NA      0.500
# 6 f          0.500  0.500  0.500  0.500  0.500 NA  

如果您需要将其设为matrix且上三角部分为NA,您可以这样做:

reviewer_percentage_mat <- reviewer_percentage %>% 
  ungroup() %>% 
  select(-reviewer1) %>% 
  as.matrix()

rownames(reviewer_percentage_mat) <- reviewer_percentage$reviewer1
reviewer_percentage_mat[upper.tri(reviewer_percentage_mat)] <-  NA

reviewer_percentage_mat
#    a   b   c   d   e  f
# a  NA  NA  NA  NA  NA NA
# b 0.5  NA  NA  NA  NA NA
# c 0.5 0.0  NA  NA  NA NA
# d 0.5 1.0 0.0  NA  NA NA
# e 0.5 0.0 1.0 0.0  NA NA
# f 0.5 0.5 0.5 0.5 0.5 NA

数据

dta <- structure(list(ptid = c(3, 3, 3, 4, 4, 4, 5, 5, 5), 
                      reviewer = c("d", "b", "f", "a", "e", "c", "a", "f", "b"), 
                      outcome = c("Yes", "Yes", "No", "No", "No", "No", "Yes", "Yes", "Yes")), 
                 .Names = c("ptid", "reviewer", "outcome"), 
                 row.names = c(NA, -9L), 
                 class = "data.frame")