我知道有很多方法可以删除重复项,但是我的问题似乎有所不同。
我有一个与此类似的data.frame
:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
x <- data.frame(id = c(1, 1, 1, 1, 2, 3, 3),
date = as.Date(c("2016-04-24", "2016-04-24", "2016-04-24",
"2016-04-24", "2016-04-24", "2016-04-28",
"2016-04-28")),
code = c("a", "b", "b", "a", "a", "a", "a"))
x
#> id date code
#> 1 1 2016-04-24 a
#> 2 1 2016-04-24 b
#> 3 1 2016-04-24 b
#> 4 1 2016-04-24 a
#> 5 2 2016-04-24 a
#> 6 3 2016-04-28 a
#> 7 3 2016-04-28 a
我想过滤掉code
“ a”而不是“ b”的所有重复项。预期的输出应如下所示:
x[c(1:3, 5:6), ]
#> id date code
#> 1 1 2016-04-24 a
#> 2 1 2016-04-24 b
#> 3 1 2016-04-24 b
#> 5 2 2016-04-24 a
#> 6 3 2016-04-28 a
我在这里有一个类似的问题:Ignore value conditionally within group_by in dplyr是我以下尝试的基础。但是这些都不起作用,这让我发疯。
x %>% group_by(id, date) %>%
filter(!(code == "a" & duplicated(code) == "a"))
#> # A tibble: 7 x 3
#> # Groups: id, date [3]
#> id date code
#> <dbl> <date> <fct>
#> 1 1. 2016-04-24 a
#> 2 1. 2016-04-24 b
#> 3 1. 2016-04-24 b
#> 4 1. 2016-04-24 a
#> 5 2. 2016-04-24 a
#> 6 3. 2016-04-28 a
#> 7 3. 2016-04-28 a
x %>% group_by(id, date) %>%
filter(!(duplicated(code) == "a" & "a" %in% code))
#> # A tibble: 7 x 3
#> # Groups: id, date [3]
#> id date code
#> <dbl> <date> <fct>
#> 1 1. 2016-04-24 a
#> 2 1. 2016-04-24 b
#> 3 1. 2016-04-24 b
#> 4 1. 2016-04-24 a
#> 5 2. 2016-04-24 a
#> 6 3. 2016-04-28 a
#> 7 3. 2016-04-28 a
由reprex package(v0.2.0)于2018-08-17创建。
我想问题出在duplicated()
调用没有返回TRUE
或FALSE
,但我不确定。
答案 0 :(得分:2)
按'id','date'分组后,获得逻辑代码,其中'code'为'a',在其上使用duplicated
或'code'不是'a'
x %>%
group_by(id, date) %>%
filter(!duplicated(code == "a") | code != 'a')
# A tibble: 5 x 3
# Groups: id, date [3]
# id date code
# <dbl> <date> <fct>
#1 1 2016-04-24 a
#2 1 2016-04-24 b
#3 1 2016-04-24 b
#4 2 2016-04-24 a
#5 3 2016-04-28 a
答案 1 :(得分:2)
使用slice
的另一种方法。按id
,date
和code
分组。如果该组中有任何a
(应该是所有a
或其他所有东西),请占据第一行,否则返回整个组:
library(dplyr)
x %>%
group_by(id, date, code) %>%
slice(if(any(code == "a")) 1 else 1:n())
结果:
# A tibble: 5 x 3
# Groups: id, date, code [4]
id date code
<dbl> <date> <fct>
1 1 2016-04-24 a
2 1 2016-04-24 b
3 1 2016-04-24 b
4 2 2016-04-24 a
5 3 2016-04-28 a
答案 2 :(得分:2)
使用data.table,您可以执行以下操作:
library(data.table)
setDT(x)
x[ code != "a" | !duplicated(x, by=c("id", "date", "code")) ]
id date code
1: 1 2016-04-24 a
2: 1 2016-04-24 b
3: 1 2016-04-24 b
4: 2 2016-04-24 a
5: 3 2016-04-28 a
这类似于@akrun的答案,但是不需要分组,因为duplicated.data.table
有一个by=
参数。使用基数R(感谢@Moody_Mudskipper),可以将其翻译为:
x[ code != "a" | !duplicated(x[c("id", "date", "code")]) ]
答案 3 :(得分:0)
这是不使用重复项的示例:
data.frame(x%>%
filter(code=="a")%>%
group_by(id, date)%>%
summarise(code=first(code)))%>%
rbind(data.frame(x%>%filter(code=="b")))
答案 4 :(得分:0)
以R为基数的另一种方法:
x$y <- cumsum(x$code=="b") * (x$code == "b")
unique(x)[-4]
# id date code
# 1 1 2016-04-24 a
# 2 1 2016-04-24 b
# 3 1 2016-04-24 b
# 5 2 2016-04-24 a
# 6 3 2016-04-28 a
(但是我可能更愿意使用我在弗兰克回答下的评论)
使用tidyverse
,我会这样:
library(tidyverse)
x %>% split(.$code) %>% map_at("a",distinct) %>% bind_rows
# id date code
# 1 1 2016-04-24 a
# 2 2 2016-04-24 a
# 3 3 2016-04-28 a
# 4 1 2016-04-24 b
# 5 1 2016-04-24 b