dplyr | group_by vs anti_join |最有效的方式

时间:2017-12-07 12:06:57

标签: r dplyr

我有一个大数据集,但在这里我创建了一个具有相同数据争论问题的示例数据

数据

brand=c('MS', 'Google', 'Apple', 'MS', 'FB', 'Apple', 'Oracle')
product=c('Window', 'Search', 'Iphone', 'Window', 'Network', 'Iphone', 'DB')
isExist=c('Yes', 'Yes', NA, 'No', NA, 'Yes', NA)
df= data.frame(brand, product, isExist)

此数据看起来像这样

   brand product isExist
1     MS  Window     Yes
2 Google  Search     Yes
3  Apple  Iphone    <NA>
4     MS  Window      No
5     FB Network    <NA>
6  Apple  Iphone     Yes
7 Oracle      DB    <NA>

现在我想要基于品牌和产品(复合键)的行,它们具有isExist的NA条目,并且没有任何其他行用于具有值的相同复合键,即它应该返回FB,Oracle而不是Apple作为一个行(第6行)在isExist中有值

我使用anti_join实现它,这里是代码

library(dplyr)
testWithData <- df %>% filter(!is.na(isExist))
testWithoutData <- df %>% filter(is.na(isExist))
final <- unique(anti_join(testWithoutData, testWithData, by = c('brand', 'product')))

输出

   brand product isExist
1     FB Network    <NA>
2 Oracle      DB    <NA>

此解决方案正在运行,但需要花费太多时间,我知道这不是最有效的方法。我觉得group_by和过滤器可以做一些魔术,但我不确定我是如何编写查询的,有人可以在这方面帮助我

1 个答案:

答案 0 :(得分:4)

brand=c('MS', 'Google', 'Apple', 'MS', 'FB', 'Apple', 'Oracle')
product=c('Window', 'Search', 'Iphone', 'Window', 'Network', 'Iphone', 'DB')
isExist=c('Yes', 'Yes', NA, 'No', NA, 'Yes', NA)
df= data.frame(brand, product, isExist)

library(dplyr)

df %>%
  group_by(brand) %>%                     # for each brand
  filter(sum(!is.na(isExist)) == 0) %>%   # get sum of values that are not NA and keep rows where the sum is 0
  ungroup()

# # A tibble: 2 x 3
#      brand product isExist
#     <fctr>  <fctr>  <fctr>
#   1     FB Network    <NA>
#   2 Oracle      DB    <NA>

如果您逐步运行(前2行,然后是前3行等),则可以理解上述过程。

df %>% 
  arrange(brand) %>%                          # order brands to have a better visualisation
  group_by(brand) %>%                         # group by brand and create (on the background) 5 sub-datasets based on each brand (see the Groups: brand [5])
  mutate(Counter = sum(!is.na(isExist))) %>%  # count how many times you have non NA values, based on a brand, and add it as a column while keeping all rows (this is like counting and joining back to the original dataset at the same time)
  filter(Counter == 0) %>%                    # keep only rows with Counter = 0 (those are the ones with only NA values)
  ungroup()                                   # forget the grouping