匹配子集中的值并返回值

时间:2018-04-02 07:12:43

标签: r if-statement dplyr match

我想在满足某些条件时返回ProjectID。例如,在每个集群中存在项目C和项目A的下面的数据中,C的ProjectID将在项目A的行中返回。我在下面给出了一个示例,其中ProjectID在Dependent列中返回。我试图通过使用dplyr中的group_by函数将每个集群分成组来解决问题但是我不确定如何在每个组中查找项目以查看是否满足条件(在这种情况下,项目C在集群内)并返回ProjectID。非常感谢任何有关如何解决的建议。

Cluster Project ProjectID Dependent
Aaa        A         1       3
Aaa        B         2
Aaa        C         3
Bbb        A         4
Bbb        B         5
Ccc        A         6       8
Ccc        B         7
Ccc        C         8
Ccc        D         9

5 个答案:

答案 0 :(得分:4)

我认为这将提供预期的输出:

library(tidyverse)
df1 %>%
  group_by(Cluster) %>%
  mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA))
#output
# A tibble: 9 x 4
# Groups: Cluster [3]
  Cluster Project ProjectID Dependent
  <fct>   <fct>       <int>     <int>
1 Aaa     A               1         3
2 Aaa     B               2        NA
3 Aaa     C               3        NA
4 Bbb     A               4        NA
5 Bbb     B               5        NA
6 Ccc     A               6         8
7 Ccc     B               7        NA
8 Ccc     C               8        NA
9 Ccc     D               9        NA

如果项目A返回项目C的projectID,则在每个集群内返回NA

数据:

df1 <- read.table(text="Cluster Project ProjectID 
Aaa        A         1       
Aaa        B         2
Aaa        C         3
Bbb        A         4
Bbb        B         5
Ccc        A         6       
Ccc        B         7
Ccc        C         8
Ccc        D         9", header = TRUE)

基准测试为小数据集提供了答案

library(microbenchmark)

microbenchmark(missuse = df1 %>%
                 group_by(Cluster) %>%
                 mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA)),
               Rui_Barradas = lapply(split(df1, df1$Cluster), function(DF){
                 DF$Dependent <- NA
                 if(any(DF$Project == "A") && any(DF$Project == "C"))
                   DF$Dependent[DF$Project == "A"] <- DF$ProjectID[DF$Project == "C"]
                 DF
               }),
               MKR = left_join(df1,filter(df1, Project=="C"),  by="Cluster") %>%
                   mutate(Dependent = ifelse(Project.x == "A",  ProjectID.y, NA)) %>%
                   select(Cluster, Project = Project.x, ProjectID = ProjectID.x, Dependent)
               )

Unit: milliseconds
         expr      min       lq      mean    median        uq      max neval cld
      missuse 3.525404 3.566450  4.220243  3.604535  3.785439 40.69046   100  b 
 Rui_Barradas 1.390526 1.423534  1.685952  1.495683  1.511552 16.30843   100 a  
          MKR 9.770077 9.959867 10.605632 10.215248 10.592078 21.14565   100   c

更大的数据集(90k行)的基准

df1 <- df1[rep(1:nrow(df1), times = 10000),]

microbenchmark(missuse = df1 %>%
                 group_by(Cluster) %>%
                 mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA)),
               Rui_Barradas = lapply(split(df1, df1$Cluster), function(DF){
                 DF$Dependent <- NA
                 if(any(DF$Project == "A") && any(DF$Project == "C"))
                   DF$Dependent[DF$Project == "A"] <- DF$ProjectID[DF$Project == "C"]
                 DF
               }), times = 20)

Unit: milliseconds
         expr      min       lq     mean   median       uq      max neval cld
      missuse 25.05783 25.53072 29.95501 25.83243 28.49352 55.34345    20  a 
 Rui_Barradas 35.42203 36.85572 47.61315 39.87882 56.25432 95.80752    20   b

和900k行:

df1 <- df1[rep(1:nrow(df1), times = 100000),] #original df1

Unit: milliseconds
         expr      min       lq      mean    median       uq      max neval cld
      missuse 466.6968 721.9709  945.8628 1062.6262 1101.914 1255.214    20   a
 Rui_Barradas 718.8869 768.0912 1077.7594  934.1785 1308.145 1854.415    20   a

我在最后两个基准测试中遗漏了MKR的答案,因为它崩溃了我的会话。

免责声明:我在马铃薯PC上运行基准测试。稍后我将重新测试另一台更新的PC,如果结果不同(就相对性能而言),我会更新答案。

更新:我觉得生成的数据(由我)有点欺骗性。这是另一种尝试:

df1 <- df1[rep(1:nrow(df1), times = 10000),]

df1 %>%
  mutate(rle = rleid(Cluster)) %>%
  mutate(Cluster = paste(Cluster, rle, sep = "_")) %>%
  select(-rle) -> df1

MKR2 <- function(df1){
setDT(df1)
df1[Project == "A"][df1[Project == "C"], on="Cluster", nomatch=0][
  df1, on=.(Cluster, Project)][
    ,.(Cluster, Project, ProjectID = i.ProjectID.1, Dependent = i.ProjectID)]
}

所以有很多小组的数据

在这里,我不得不忽略Rui Barradas的解决方案,因为它花了太长时间:

microbenchmark(missuse = df1 %>%
                 group_by(Cluster) %>%
                 mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA)),
               MKR = left_join(df1,filter(df1, Project=="C"),  by="Cluster") %>%
                   mutate(Dependent = ifelse(Project.x == "A",  ProjectID.y, NA)) %>%
                   select(Cluster, Project = Project.x, ProjectID = ProjectID.x, Dependent),
               MKR2(df1),
               times = 10
               )

Unit: milliseconds
      expr        min        lq      mean    median        uq        max neval cld
   missuse 7445.97748 7815.2364 9609.4009 8350.0508 9565.2411 19965.5040    10   b
       MKR   55.61109   59.9900  123.2263   80.4056  191.7361   250.5065    10  a 
 MKR2(df1)  100.97692  216.4811  994.8457  277.3159 1452.0668  4011.1804    10  a 

有趣的东西

答案 1 :(得分:3)

只有基数R,您才能执行以下操作。

sp <- lapply(split(dat, dat$Cluster), function(DF){
    DF$Dependent <- NA
    if(any(DF$Project == "A") && any(DF$Project == "C"))
        DF$Dependent[DF$Project == "A"] <- DF$ProjectID[DF$Project == "C"]
    DF
})

result <- do.call(rbind, sp)
row.names(result) <- NULL
result
#  Cluster Project ProjectID Dependent
#1     Aaa       A         1         3
#2     Aaa       B         2        NA
#3     Aaa       C         3        NA
#4     Bbb       A         4        NA
#5     Bbb       B         5        NA
#6     Ccc       A         6         8
#7     Ccc       B         7        NA
#8     Ccc       C         8        NA
#9     Ccc       D         9        NA

数据。

dat <-
structure(list(Cluster = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 
3L, 3L, 3L), .Label = c("Aaa", "Bbb", "Ccc"), class = "factor"), 
    Project = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 4L), .Label = c("A", 
    "B", "C", "D"), class = "factor"), ProjectID = 1:9), .Names = c("Cluster", 
"Project", "ProjectID"), class = "data.frame", row.names = c(NA, 
-9L))

答案 2 :(得分:3)

已经提供了很少的好答案,但我认为另一种方法可能是left_join使用self

library(dplyr)
left_join(df1,filter(df1, Project=="C"),  by="Cluster") %>%
  mutate(Dependent = ifelse(Project.x == "A",  ProjectID.y, NA)) %>%
  select(Cluster, Project = Project.x, ProjectID = ProjectID.x, Dependent)

data.table方法:

library(data.table)
setDT(df1)
df1[Project == "A"][df1[Project == "C"], on="Cluster", nomatch=0][
df1, on=.(Cluster, Project)][
,.(Cluster, Project, ProjectID = i.ProjectID.1, Dependent = i.ProjectID)]


#       Cluster Project ProjectID Dependent
# 1     Aaa       A         1         3
# 2     Aaa       B         2        NA
# 3     Aaa       C         3        NA
# 4     Bbb       A         4        NA
# 5     Bbb       B         5        NA
# 6     Ccc       A         6         8
# 7     Ccc       B         7        NA
# 8     Ccc       C         8        NA
# 9     Ccc       D         9        NA

希望,值得考虑这些方法的表现。

答案 3 :(得分:1)

data.table解决这个问题的方法

df1[Project == 'A'][df1[Project == 'C'], 
Dependent := i.ProjectID, 
on = 'Cluster'][df1, on =
c('Cluster', 'Project', 'ProjectID')]

子集df1其中project=='A'

然后在Project =='C'

上右键加入df1的子集'Cluster'

然后再次在群集,项目和项目ID上再次加入。

我不确定这是否可以在大型数据集上运行,因为它有很多自联接,特别是在数据重复时,因为如果你有重复的密钥,数据表就不会让你加入 - 值。

   Cluster Project ProjectID Dependent
1:     Aaa       A         1         3
2:     Aaa       B         2        NA
3:     Aaa       C         3        NA
4:     Bbb       A         4        NA
5:     Bbb       B         5        NA
6:     Ccc       A         6         8
7:     Ccc       B         7        NA
8:     Ccc       C         8        NA
9:     Ccc       D         9        NA

希望这有帮助

答案 4 :(得分:1)

另一种dplyr方式似乎比@ missuse方式快一点:

df1 %>%
  group_by(Cluster) %>%
  mutate(Dependent= `[<-`(rep_len(NA_integer_,n()),
                          Project=="A",
                          value=ProjectID[match("C",Project)]))

# # A tibble: 9 x 4
# # Groups:   Cluster [3]
# Cluster Project ProjectID Dependent
# <fctr>  <fctr>     <int>     <int>
# 1     Aaa       A         1         3
# 2     Aaa       B         2        NA
# 3     Aaa       C         3        NA
# 4     Bbb       A         4        NA
# 5     Bbb       B         5        NA
# 6     Ccc       A         6         8
# 7     Ccc       B         7        NA
# 8     Ccc       C         8        NA
# 9     Ccc       D         9        NA