df <- data.frame(id = c(1, 1, 1, 2, 2),
gender = c("Female", "Female", "Male", "Female", "Male"),
variant = c("a", "b", "c", "d", "e"))
> df
id gender variant
1 1 Female a
2 1 Female b
3 1 Male c
4 2 Female d
5 2 Male e
我想根据数据集中的gender
列删除data.frame中的重复行。我知道有一个类似的问题(here),但这里的区别是我想删除数据集的每个子集中的重复行,其中每个子集由唯一的id
定义。
我想要的结果是:
id gender variant
1 1 Female a
3 1 Male c
4 2 Female d
5 2 Male e
我已经尝试了以下内容并且它有效,但我想知道是否有更干净,更有效的方法吗?
out = list()
for(i in 1:2){
df2 <- subset(df, id == i)
out[[i]] <- df2[!duplicated(df2$gender), ]
}
do.call(rbind.data.frame, out)
答案 0 :(得分:4)
df[!duplicated(df[c("id","gender")]),]
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
使用subset
执行此操作的另一种方法如下:
subset(df, !duplicated(subset(df, select=c(id, gender))))
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
答案 1 :(得分:1)
以下是基于dplyr
的解决方案,以备您感兴趣(已编辑以包含Gregor的建议)
library(dplyr)
group_by(df, id, gender) %>% slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, gender [4]
#> id gender variant
#> <dbl> <fctr> <fctr>
#> 1 1 Female a
#> 2 1 Male c
#> 3 2 Female d
#> 4 2 Male e
根据应删除arrange
的值,也可能值得使用variant
函数。