识别仅包含表达式重复的子集

时间:2018-03-16 05:08:50

标签: r

我有一个像这样的数据集:

 df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B", 
                   "C","C","C","C","C","D","D","D","D","D"),  
                y= as.factor(c(rep("Eoissp2",4),rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2","Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))

我想为x的每个子集确定y中包含表达式Eois的完全重复的相应级别。因此,ABD将在向量中返回,因为ABD的每个级别都包含表达式Eois,而等级C由各种独特的等级组成(例如Eois,Automeris和Acharias)。对于此示例,输出将为:

   output<- c("A", "B", "D")

2 个答案:

答案 0 :(得分:0)

使用新的df:

> df %>% filter(str_detect(y,"Eois")) %>% group_by(x) %>% distinct(y) %>% 
    count() %>% filter(n==1) %>% select(x)
# A tibble: 2 x 1
# Groups:   x [2]
  x    
  <fct>
1 A    
2 B   

(以下答案使用问题作者发布的原始df。)

使用magrittr&amp;中的管道功能来自dplyr的函数:

> df %>% group_by(x) %>% distinct(y)
# A tibble: 7 x 2
# Groups:   x [3]
  x     y      
  <fct> <fct>  
1 A     plant1a
2 B     plant1b
3 C     plant1a
4 C     plant2a
5 C     plant3a
6 C     plant4a 
7 C     plant5a

然后你可以像这样汇总结果:

> results <- df %>% group_by(x) %>% distinct(y) %>% 
    count() %>% filter(n==1) %>% select(x)
> results
# A tibble: 2 x 1
# Groups:   x [2]
  x    
  <fct>
1 A    
2 B   

如果您知道原始数据框始终按顺序附带x,则可以删除group_by部分。

答案 1 :(得分:0)

基于dplyr的解决方案可以是:

library(dplyr)
df %>% group_by(x) %>%
  filter(grepl("Eoiss", y)) %>%
  mutate(y = sub("\\d+", "", y)) %>%
  filter(n() >1 & length(unique(y)) == 1) %>%
  select(x) %>% unique(.)

# A tibble: 3 x 1
# Groups: x [3]
#  x     
#  <fctr>
#1 A     
#2 B     
#3 D

数据

df<-data.frame(x=c("A","A","A","A", "B","B","B","B","B", 
                   "C","C","C","C","C","D","D","D","D","D"),  
               y= as.factor(c(rep("Eoissp2",4),
      rep("Eoissp1",5),"Eoissp1","Eoisp4","Automerissp1","Automerissp2",
      "Acharias",rep("Eoissp2",3),rep("Eoissp1",2))))