使用R根据特定条件从数据框中删除重复行

时间:2015-09-23 13:58:35

标签: r

我正在开展一个项目,我需要根据人们的投票方式对数据进行排序。我找不到一个功能,我可以根据满足的某些条件删除重复的行。

我正在寻找一个能够根据一个具有重复值的列和另一个满足特定条件的列删除重复行的函数。

例如,在下表中,我想删除在三次不同选举中投票的选民。保罗需要从这个数据框中删除。

 df <- data.frame(Name=c("Paul","Paul","Mary","Bill","Jane","Paul","Mary","John",
"Bill","John"),ElectionDay=c("November 2010","November 2014",
"November 2010","November 2010","November 2014","November 2006",
"November 2014","November 2010","November 2014","November 2014"))

df
#    Name   ElectionDay
# 1  Paul November 2010
# 2  Paul November 2014
# 3  Mary November 2010
# 4  Bill November 2010
# 5  Jane November 2014
# 6  Paul November 2006
# 7  Mary November 2014
# 8  John November 2010
# 9  Bill November 2014
# 10 John November 2014

以下是我正在寻找的结果示例:

   Name   ElectionDay
1  Mary November 2010
2  Bill November 2010
3  Jane November 2014
4  Mary November 2014
5  John November 2010
6  Bill November 2014
7  John November 2014

3 个答案:

答案 0 :(得分:6)

我们可以使用data.table。我们将'data.frame'转换为'data.table'(setDT(df)),按'Name'分组,我们获得唯一'ElectionDay'(uniqueN(ElectionDay))的长度。如果长度小于3,我们得到Data.Table的子集(.SD)。

library(data.table)#v1.9.6+
setDT(df)[, if(uniqueN(ElectionDay) < 3) .SD, by = Name]

类似的基本R选项将使用ave。我们得到'ElectionDay'的lengthunique元素按'名称'分组,并检查它是否小于3以获得逻辑索引。索引可用于对数据集行进行子集化。

df[with(df, ave(as.character(ElectionDay), Name,
                FUN=function(x) length(unique(x)))) < 3,]
#   Name   ElectionDay
#3  Mary November 2010
#4  Bill November 2010
#5  Jane November 2014
#7  Mary November 2014
#8  John November 2010
#9  Bill November 2014
#10 John November 2014

答案 1 :(得分:4)

超过2行的名称计算为

names(which(table(df$Name) > 2))
#[1] "Paul"

所以你需要的是

df[!(df$Name %in% names(which(table(df$Name) > 2))), ]
#   Name   ElectionDay
#3  Mary November 2010
#4  Bill November 2010
#5  Jane November 2014
#7  Mary November 2014
#8  John November 2010
#9  Bill November 2014
#10 John November 2014

答案 2 :(得分:1)

或者您也可以使用dplyr,计算每个人投票的选举次数,然后删除计数为3的行:

library(dplyr)
df %>% 
  group_by(Name) %>% 
  mutate(NumberElections = length(unique(ElectionDay))) %>% 
  ungroup() %>% 
  filter(NumberElections != 3)