我有一个大数据集,但在这里我创建了一个具有相同数据争论问题的示例数据
数据
brand=c('MS', 'Google', 'Apple', 'MS', 'FB', 'Apple', 'Oracle')
product=c('Window', 'Search', 'Iphone', 'Window', 'Network', 'Iphone', 'DB')
isExist=c('Yes', 'Yes', NA, 'No', NA, 'Yes', NA)
df= data.frame(brand, product, isExist)
此数据看起来像这样
brand product isExist
1 MS Window Yes
2 Google Search Yes
3 Apple Iphone <NA>
4 MS Window No
5 FB Network <NA>
6 Apple Iphone Yes
7 Oracle DB <NA>
现在我想要基于品牌和产品(复合键)的行,它们具有isExist的NA条目,并且没有任何其他行用于具有值的相同复合键,即它应该返回FB,Oracle而不是Apple作为一个行(第6行)在isExist中有值
我使用anti_join实现它,这里是代码
library(dplyr)
testWithData <- df %>% filter(!is.na(isExist))
testWithoutData <- df %>% filter(is.na(isExist))
final <- unique(anti_join(testWithoutData, testWithData, by = c('brand', 'product')))
输出
brand product isExist
1 FB Network <NA>
2 Oracle DB <NA>
此解决方案正在运行,但需要花费太多时间,我知道这不是最有效的方法。我觉得group_by和过滤器可以做一些魔术,但我不确定我是如何编写查询的,有人可以在这方面帮助我
答案 0 :(得分:4)
brand=c('MS', 'Google', 'Apple', 'MS', 'FB', 'Apple', 'Oracle')
product=c('Window', 'Search', 'Iphone', 'Window', 'Network', 'Iphone', 'DB')
isExist=c('Yes', 'Yes', NA, 'No', NA, 'Yes', NA)
df= data.frame(brand, product, isExist)
library(dplyr)
df %>%
group_by(brand) %>% # for each brand
filter(sum(!is.na(isExist)) == 0) %>% # get sum of values that are not NA and keep rows where the sum is 0
ungroup()
# # A tibble: 2 x 3
# brand product isExist
# <fctr> <fctr> <fctr>
# 1 FB Network <NA>
# 2 Oracle DB <NA>
如果您逐步运行(前2行,然后是前3行等),则可以理解上述过程。
df %>%
arrange(brand) %>% # order brands to have a better visualisation
group_by(brand) %>% # group by brand and create (on the background) 5 sub-datasets based on each brand (see the Groups: brand [5])
mutate(Counter = sum(!is.na(isExist))) %>% # count how many times you have non NA values, based on a brand, and add it as a column while keeping all rows (this is like counting and joining back to the original dataset at the same time)
filter(Counter == 0) %>% # keep only rows with Counter = 0 (those are the ones with only NA values)
ungroup() # forget the grouping