Model<-c("A","A","A","A","A","B","B","B","B","B","C","C","C","C")
Price<-c(12,14,15,13,16,36,32,24,14,15,14,11,24,31)
region<-c("W","E","E","W","W","E","E","E","E","W","W","W","E","W")
dt<-data.frame(Model,Price,region)
Model Price region
1 A 12 W
2 A 14 E
3 A 15 E
4 A 13 W
5 A 16 W
6 B 36 E
7 B 32 E
8 B 24 E
9 B 14 E
10 B 15 W
11 C 14 W
12 C 11 W
13 C 24 E
14 C 31 W
>
如果该模型类型中只发生一个W或E,我想要删除行。我们保留模型A的所有行。我们删除了第10行,因为模型B中只有1 W。我们还删除了第13行,因为模型C中只有1 E.
如何在R中做到这一点?我有大约20,000个观察数千种模型类型。我可能需要写一个循环。
答案 0 :(得分:4)
Model<-c("A","A","A","A","A","B","B","B","B","B","C","C","C","C")
Price<-c(12,14,15,13,16,36,32,24,14,15,14,11,24,31)
region<-c("W","E","E","W","W","E","E","E","E","W","W","W","E","W")
dt<-data.frame(Model,Price,region)
这些将被删除
dt[!(duplicated(dt[, -2]) | duplicated(dt[, -2], fromLast = TRUE)), ]
# Model Price region
# 10 B 15 W
# 13 C 24 E
这些将被保留
dt[duplicated(dt[, -2]) | duplicated(dt[, -2], fromLast = TRUE), ]
# Model Price region
# 1 A 12 W
# 2 A 14 E
# 3 A 15 E
# 4 A 13 W
# 5 A 16 W
# 6 B 36 E
# 7 B 32 E
# 8 B 24 E
# 9 B 14 E
# 11 C 14 W
# 12 C 11 W
# 14 C 31 W
对于20k观测,近5000种模型类型
set.seed(1)
n <- 20000
dd <- data.frame(Model = sample(1:5000, n, TRUE),
Price = rpois(n, 15),
region = sample(c('E','W'), n, TRUE))
dim(dd[duplicated(dd[, -2]) | duplicated(dd[, -2], fromLast = TRUE), ])
# [1] 17289 3
如果你想要更多地控制数字,你可以使用类似下面的东西,这几乎一样快,虽然我只尝试了200k obs和10k模型。将1更改为其他数字
dim(dd[ave(as.numeric(dd$region), dd[, -2], FUN = length) > 1, ])
# [1] 17289 3
dt[ave(as.numeric(dt$region), dt[, -2], FUN = length) > 1, ]
# Model Price region
# 1 A 12 W
# 2 A 14 E
# 3 A 15 E
# 4 A 13 W
# 5 A 16 W
# 6 B 36 E
# 7 B 32 E
# 8 B 24 E
# 9 B 14 E
# 11 C 14 W
# 12 C 11 W
# 14 C 31 W
答案 1 :(得分:1)
您可以创建一个计数器变量并按此过滤。使用 dplyr 包:
library(dplyr)
dt <- dt %>% group_by(Model) %>% filter(n_distinct(region) > 1) %>% group_by(Model, region) %>% filter(n() > 1)