Question

我有一个包含多列和多行（200k）的大型数据框。我按组变量排序行，每个组可以有一个或多个条目。每组的其他列应具有相同的值，但在某些情况下，它们不具有相同的值。它看起来像这样：

group   name    age    color    city
1       Anton   50     orange   NY
1       Anton   21     red      NY
1       Anton   21     red      NJ
2       Martin  78     black    LA
2       Martin  78     blue     LA
3       Maria   29     red      NC
3       Maria   29     pink     LV
4       Jake    33     blue     NJ

如果组的所有行的年龄或城市不相同（观察错误的指示），我想删除组的所有条目。否则，我想保留所有条目。

我希望的输出是：

group   name    age    color    city
2       Martin  78     black    LA
2       Martin  78     blue     LA
4       Jake    33     blue     NJ

我最接近的是：

dup <- df[ duplicated(df[,c("group","name","color")]) | duplicated(df[,c("group","name","color")],fromLast=TRUE)    ,"group"]
df_nodup <- df[!(df$group %in% dup),]

但是，这远远没有做我需要的一切。

P.s：我对py / pandas回答了同样的问题。我也想为R提供解决方案。

/ e：虽然弗兰克的回答有助于理解解决方案的原理并且他的第二个建议有效，但速度很慢。（我的df花了~15分钟）。 user20650的答案更难以理解，但运行得更快（约10秒）。

Answer 1

与Franks类似的方法，您可以length unique和age的{{1}}组合city来计算group - 使用{{ave 1}}。如果唯一组合的长度大于1，则可以对数据进行子集化

# your data

df <- read.table(text="group   name    age    color    city
1       Anton   50     orange   NY
1       Anton   21     red      NY
1       Anton   21     red      NJ
2       Martin  78     black    LA
2       Martin  78     blue     LA
3       Maria   29     red      NC
3       Maria   29     pink     LV
4       Jake    33     blue     NJ ", header=T)

# calculate and subset

df[with(df, ave(paste(age, city), group, FUN=function(x) length(unique(x))))==1,]

#   group   name age color city
# 4     2 Martin  78 black   LA
# 5     2 Martin  78  blue   LA
# 8     4   Jake  33  blue   NJ

Answer 2

这是一种方法：

temp <- tapply(df$group, list(df$name, df$age, df$city), unique)  
temp[!is.na(temp)] <- 1
keepers <- names(which(apply(temp, 1, sum, na.rm=TRUE)==1))

df[df$name %in% keepers, ]
#4     2 Martin  78 black   LA
#5     2 Martin  78  blue   LA
#8     4   Jake  33  blue   NJ

替代，稍微简单的方法：

temp2 <- unique(df[,c('name','age','city')])
keepers2 <- names(which(tapply(temp2$name, temp2$name, length)==1))

df[df$name %in% keepers2, ]
#  group   name age color city
#4     2 Martin  78 black   LA
#5     2 Martin  78  blue   LA
#8     4   Jake  33  blue   NJ

Answer 3

这是使用dplyr的方法：

df <- read.table(text = "
  group   name    age    color    city
  1       Anton   50     orange   NY
  1       Anton   21     red      NY
  1       Anton   21     red      NJ
  2       Martin  78     black    LA
  2       Martin  78     blue     LA
  3       Maria   29     red      NC
  3       Maria   29     pink     LV
  4       Jake    33     blue     NJ 
", header = TRUE)

library(dplyr)
df %>% 
  group_by(group) %>%
  filter(n_distinct(age) == 1 && n_distinct(city) == 1)

我认为很容易看到发生了什么 - 你进行分组，然后过滤以在只有一个不同的年龄和城市时保留群组。

如果在特定列中重复，则从DF中删除观察值，而其他列必须不同

3 个答案: