Dplyr对分组中的值进行计算,将每个项目与该组中的所有其他项目进行比较

时间:2019-11-15 13:57:58

标签: r dataframe dplyr lapply

我想弄清楚分组中的值是否与分组中的其他值足够不同。具体来说,我想算出同一位学生的比赛结束时间是否与同一天另一堂课的开始时间相匹配。使用菱形,这是等效的代码:

library(ggplot2)
diamonds %>% group_by(color, cut) %>% 
  mutate(clash = sum(
           lapply(
             diamonds %>% 
               filter(color == color, cut == cut, carat != carat) %$% carat,
             function(x) ifelse(x < carat - 0.01 && x > carat + 0.01, 1, 0)))) %>%
  arrange(color, cut, clash)

计划是如果冲突大于1,那么我知道另一颗钻石的克拉大小与该组中的钻石非常接近。这给了我以下错误:

Error in sum(sapply(diamonds %>% filter(color == color, cut == cut, carat !=  : 
  invalid 'type' (list) of argument

这使对钻石的第二次通话显得晦涩

1 个答案:

答案 0 :(得分:2)

您可以使用pmap代替lapply,它更适合tidyverse内:

library(tidyverse)

myfun <- function(.color, .cut, .carat){
 diamonds %>%
    filter(color == .color, cut == .cut, !between(carat, .carat - 0.01, .carat + 0.01)) %>%
    nrow()
}

diamonds %>% 
  mutate(clash = pmap_int(list(color, cut, carat), myfun)) %>%
  arrange(color, cut, clash)

# A tibble: 53,940 x 11
   carat cut   color clarity depth table price     x     y     z clash
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <int>
 1  1.01 Fair  D     SI2      64.6    56  3003  6.31  6.24  4.05   124
 2  1.01 Fair  D     SI2      64.7    57  3871  6.31  6.27  4.07   124
 3  1.01 Fair  D     SI1      66.3    55  4118  6.22  6.17  4.11   124
 4  1.01 Fair  D     SI2      65.3    55  4205  6.33  6.19  4.09   124
 5  1.01 Fair  D     SI1      65.9    60  4276  6.32  6.18  4.12   124
 6  1.01 Fair  D     SI2      64.6    62  4538  6.26  6.21  4.03   124
 7  1.01 Fair  D     SI1      63.5    58  4751  6.35  6.25  4      124
 8  1.01 Fair  D     SI1      64.6    60  4751  6.12  6.08  3.94   124
 9  1.01 Fair  D     SI1      66.9    54  4751  6.25  6.21  4.17   124
10  1.01 Fair  D     SI1      66.2    56  5122  6.05  6.1   4.02   124

请注意,此解决方案有效,但效率不高。您可以轻松地修改以下代码以按组操作:

diamonds2 <- diamonds %>%
  count(color, carat, cut)

myfun2 <- function(.color, .cut, .carat){
  diamonds2 %>%
    filter(color == .color, cut == .cut, !between(carat, .carat - 0.01, .carat + 0.01)) %>%
    pull(n) %>% sum
}

diamonds2 %>% 
  mutate(clash = pmap_int(list(color, cut, carat), myfun2)) %>%
  left_join(diamonds, ., by = c("color", "carat", "cut")) %>%
  arrange(color, cut, clash)

结果相同,但是第二个版本(使用myfun2)的速度更快。

编辑

要查看我们还使用clarity进行过滤的示例,请参见以下示例:

diamonds3 <- diamonds %>%
  count(color, carat, cut, clarity)


myfun3 <- function(.color, .cut, .carat, .clarity){
  diamonds3 %>%
    filter(color == .color, cut == .cut, clarity == .clarity, 
           !between(carat, .carat - 0.01, .carat + 0.01)) %>%
    pull(n) %>% sum
}

 myfun3(.color = "D", .cut == "Fair", .clarity = "I1", .carat = 1.5)   
[1] 3