我有以下数据集(示例)
idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','T89')
df = data.frame(idnumber,reg,code1,code2)
给出:
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
4 14 US T78 S46
5 14 US G78 D78
6 15 TZ G76 G56
7 16 MK G64 H89
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
我希望将df
的子集保留为G56
或code1
列中值code 2
出现的原始值,但是如果id值是与值idnumber
匹配的相同id值,例如:
G56
我有数百万个观察值,大约有30
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
6 15 TZ G76 G56
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
列。
希望这个问题很清楚,任何建议都将受到欢迎!
欢呼
答案 0 :(得分:2)
这是一种方法:
library(data.table)
setDT(df)
df[,.SD[any(code1 == 'G56' | code2 == 'G56')] ,.(idnumber)]
idnumber reg code1 code2
1: 12 FR F56 G56
2: 12 FR G76 I89
3: 13 DE G56 J83
4: 15 TZ G76 G56
5: 17 GR T65 G56
6: 18 ES G79 W34
7: 18 ES G56 T89
答案 1 :(得分:1)
1。基本
subset(df, idnumber %in% idnumber[code1=="G56" | code2=="G56"])
2。 dplyr
library(dplyr)
df %>% filter(idnumber %in% idnumber[code1=="G56" | code2=="G56"])
输出
# idnumber reg code1 code2
# 1 12 FR F56 G56
# 2 12 FR G76 I89
# 3 13 DE G56 J83
# 4 15 TZ G76 G56
# 5 17 GR T65 G56
# 6 18 ES G79 W34
# 7 18 ES G56 T89
答案 2 :(得分:1)
另一种基础R解决方案
subset(df,`class<-`(ave(cbind(as.character(code1),as.character(code2)),
idnumber,
FUN = function(v) ifelse("G56"%in%v,TRUE,FALSE)),"logical")[,1])
这样
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
6 15 TZ G76 G56
8 17 GR T65 G56
9 18 ES G79 W34
10 18 ES G56 T89
答案 3 :(得分:0)
library(dplyr)
df %>%
semi_join(df %>%
filter(code1=="G56" | code2=="G56"),by="idnumber")
idnumber reg code1 code2
1 12 FR F56 G56
2 12 FR G76 I89
3 13 DE G56 J83
4 15 TZ G76 G56
5 17 GR T65 G56
6 18 ES G79 W34
7 18 ES G56 T89
编辑:使用30个代码列可能会更简单:
df %>%
semi_join(df %>%
pivot_longer(cols=-c(idnumber, reg)) %>%
filter(value=="G56") %>%
pivot_wider(id_cols=c(idnumber, reg)),
by="idnumber")
第三种选择:
df %>%
semi_join(df %>%
filter_at(vars(starts_with("code")), any_vars(. == "G56")),
by="idnumber")
编辑:如果“ G56”在“代码”列中至少出现两次,OP现在希望过滤记录(请参见下面的评论)
df %>%
semi_join(df %>%
mutate(n=rowSums(.[grep("code", names(.))] =="G56")) %>%
group_by(idnumber) %>%
filter(sum(n)>1),
by="idnumber")
idnumber reg code1 code2 code3
1 12 FR F56 G56 M56
2 12 FR G76 I89 G56
3 18 ES G79 W34 W33
4 18 ES G56 G56 T89
idnumber=c(12,12,13,14,14,15,16,17,18,18)
reg = c('FR','FR','DE','US','US','TZ','MK','GR','ES','ES')
code1=c('F56','G76','G56','T78','G78','G76','G64','T65','G79','G56')
code2=c('G56','I89','J83','S46','D78','G56','H89','G56','W34','G56')
code3=c('M56','G56','J83','S46','D78','G46','H89','J56','W33','T89')
df = data.frame(idnumber,reg,code1,code2,code3)