我想在满足某些条件时返回ProjectID。例如,在每个集群中存在项目C和项目A的下面的数据中,C的ProjectID将在项目A的行中返回。我在下面给出了一个示例,其中ProjectID在Dependent列中返回。我试图通过使用dplyr中的group_by函数将每个集群分成组来解决问题但是我不确定如何在每个组中查找项目以查看是否满足条件(在这种情况下,项目C在集群内)并返回ProjectID。非常感谢任何有关如何解决的建议。
Cluster Project ProjectID Dependent
Aaa A 1 3
Aaa B 2
Aaa C 3
Bbb A 4
Bbb B 5
Ccc A 6 8
Ccc B 7
Ccc C 8
Ccc D 9
答案 0 :(得分:4)
我认为这将提供预期的输出:
library(tidyverse)
df1 %>%
group_by(Cluster) %>%
mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA))
#output
# A tibble: 9 x 4
# Groups: Cluster [3]
Cluster Project ProjectID Dependent
<fct> <fct> <int> <int>
1 Aaa A 1 3
2 Aaa B 2 NA
3 Aaa C 3 NA
4 Bbb A 4 NA
5 Bbb B 5 NA
6 Ccc A 6 8
7 Ccc B 7 NA
8 Ccc C 8 NA
9 Ccc D 9 NA
如果项目A返回项目C的projectID,则在每个集群内返回NA
数据:
df1 <- read.table(text="Cluster Project ProjectID
Aaa A 1
Aaa B 2
Aaa C 3
Bbb A 4
Bbb B 5
Ccc A 6
Ccc B 7
Ccc C 8
Ccc D 9", header = TRUE)
基准测试为小数据集提供了答案
library(microbenchmark)
microbenchmark(missuse = df1 %>%
group_by(Cluster) %>%
mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA)),
Rui_Barradas = lapply(split(df1, df1$Cluster), function(DF){
DF$Dependent <- NA
if(any(DF$Project == "A") && any(DF$Project == "C"))
DF$Dependent[DF$Project == "A"] <- DF$ProjectID[DF$Project == "C"]
DF
}),
MKR = left_join(df1,filter(df1, Project=="C"), by="Cluster") %>%
mutate(Dependent = ifelse(Project.x == "A", ProjectID.y, NA)) %>%
select(Cluster, Project = Project.x, ProjectID = ProjectID.x, Dependent)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
missuse 3.525404 3.566450 4.220243 3.604535 3.785439 40.69046 100 b
Rui_Barradas 1.390526 1.423534 1.685952 1.495683 1.511552 16.30843 100 a
MKR 9.770077 9.959867 10.605632 10.215248 10.592078 21.14565 100 c
更大的数据集(90k行)的基准
df1 <- df1[rep(1:nrow(df1), times = 10000),]
microbenchmark(missuse = df1 %>%
group_by(Cluster) %>%
mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA)),
Rui_Barradas = lapply(split(df1, df1$Cluster), function(DF){
DF$Dependent <- NA
if(any(DF$Project == "A") && any(DF$Project == "C"))
DF$Dependent[DF$Project == "A"] <- DF$ProjectID[DF$Project == "C"]
DF
}), times = 20)
Unit: milliseconds
expr min lq mean median uq max neval cld
missuse 25.05783 25.53072 29.95501 25.83243 28.49352 55.34345 20 a
Rui_Barradas 35.42203 36.85572 47.61315 39.87882 56.25432 95.80752 20 b
和900k行:
df1 <- df1[rep(1:nrow(df1), times = 100000),] #original df1
Unit: milliseconds
expr min lq mean median uq max neval cld
missuse 466.6968 721.9709 945.8628 1062.6262 1101.914 1255.214 20 a
Rui_Barradas 718.8869 768.0912 1077.7594 934.1785 1308.145 1854.415 20 a
我在最后两个基准测试中遗漏了MKR的答案,因为它崩溃了我的会话。
免责声明:我在马铃薯PC上运行基准测试。稍后我将重新测试另一台更新的PC,如果结果不同(就相对性能而言),我会更新答案。
更新:我觉得生成的数据(由我)有点欺骗性。这是另一种尝试:
df1 <- df1[rep(1:nrow(df1), times = 10000),]
df1 %>%
mutate(rle = rleid(Cluster)) %>%
mutate(Cluster = paste(Cluster, rle, sep = "_")) %>%
select(-rle) -> df1
MKR2 <- function(df1){
setDT(df1)
df1[Project == "A"][df1[Project == "C"], on="Cluster", nomatch=0][
df1, on=.(Cluster, Project)][
,.(Cluster, Project, ProjectID = i.ProjectID.1, Dependent = i.ProjectID)]
}
所以有很多小组的数据
在这里,我不得不忽略Rui Barradas的解决方案,因为它花了太长时间:
microbenchmark(missuse = df1 %>%
group_by(Cluster) %>%
mutate(Dependent = ifelse(Project == "A", ProjectID[Project=="C"], NA)),
MKR = left_join(df1,filter(df1, Project=="C"), by="Cluster") %>%
mutate(Dependent = ifelse(Project.x == "A", ProjectID.y, NA)) %>%
select(Cluster, Project = Project.x, ProjectID = ProjectID.x, Dependent),
MKR2(df1),
times = 10
)
Unit: milliseconds
expr min lq mean median uq max neval cld
missuse 7445.97748 7815.2364 9609.4009 8350.0508 9565.2411 19965.5040 10 b
MKR 55.61109 59.9900 123.2263 80.4056 191.7361 250.5065 10 a
MKR2(df1) 100.97692 216.4811 994.8457 277.3159 1452.0668 4011.1804 10 a
有趣的东西
答案 1 :(得分:3)
只有基数R,您才能执行以下操作。
sp <- lapply(split(dat, dat$Cluster), function(DF){
DF$Dependent <- NA
if(any(DF$Project == "A") && any(DF$Project == "C"))
DF$Dependent[DF$Project == "A"] <- DF$ProjectID[DF$Project == "C"]
DF
})
result <- do.call(rbind, sp)
row.names(result) <- NULL
result
# Cluster Project ProjectID Dependent
#1 Aaa A 1 3
#2 Aaa B 2 NA
#3 Aaa C 3 NA
#4 Bbb A 4 NA
#5 Bbb B 5 NA
#6 Ccc A 6 8
#7 Ccc B 7 NA
#8 Ccc C 8 NA
#9 Ccc D 9 NA
数据。强>
dat <-
structure(list(Cluster = structure(c(1L, 1L, 1L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("Aaa", "Bbb", "Ccc"), class = "factor"),
Project = structure(c(1L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor"), ProjectID = 1:9), .Names = c("Cluster",
"Project", "ProjectID"), class = "data.frame", row.names = c(NA,
-9L))
答案 2 :(得分:3)
已经提供了很少的好答案,但我认为另一种方法可能是left_join
使用self
:
library(dplyr)
left_join(df1,filter(df1, Project=="C"), by="Cluster") %>%
mutate(Dependent = ifelse(Project.x == "A", ProjectID.y, NA)) %>%
select(Cluster, Project = Project.x, ProjectID = ProjectID.x, Dependent)
data.table
方法:
library(data.table)
setDT(df1)
df1[Project == "A"][df1[Project == "C"], on="Cluster", nomatch=0][
df1, on=.(Cluster, Project)][
,.(Cluster, Project, ProjectID = i.ProjectID.1, Dependent = i.ProjectID)]
# Cluster Project ProjectID Dependent
# 1 Aaa A 1 3
# 2 Aaa B 2 NA
# 3 Aaa C 3 NA
# 4 Bbb A 4 NA
# 5 Bbb B 5 NA
# 6 Ccc A 6 8
# 7 Ccc B 7 NA
# 8 Ccc C 8 NA
# 9 Ccc D 9 NA
希望,值得考虑这些方法的表现。
答案 3 :(得分:1)
data.table解决这个问题的方法
df1[Project == 'A'][df1[Project == 'C'],
Dependent := i.ProjectID,
on = 'Cluster'][df1, on =
c('Cluster', 'Project', 'ProjectID')]
子集df1其中project=='A'
,
然后在Project =='C'
'Cluster'
然后再次在群集,项目和项目ID上再次加入。
我不确定这是否可以在大型数据集上运行,因为它有很多自联接,特别是在数据重复时,因为如果你有重复的密钥,数据表就不会让你加入 - 值。
Cluster Project ProjectID Dependent
1: Aaa A 1 3
2: Aaa B 2 NA
3: Aaa C 3 NA
4: Bbb A 4 NA
5: Bbb B 5 NA
6: Ccc A 6 8
7: Ccc B 7 NA
8: Ccc C 8 NA
9: Ccc D 9 NA
希望这有帮助
答案 4 :(得分:1)
另一种dplyr
方式似乎比@ missuse方式快一点:
df1 %>%
group_by(Cluster) %>%
mutate(Dependent= `[<-`(rep_len(NA_integer_,n()),
Project=="A",
value=ProjectID[match("C",Project)]))
# # A tibble: 9 x 4
# # Groups: Cluster [3]
# Cluster Project ProjectID Dependent
# <fctr> <fctr> <int> <int>
# 1 Aaa A 1 3
# 2 Aaa B 2 NA
# 3 Aaa C 3 NA
# 4 Bbb A 4 NA
# 5 Bbb B 5 NA
# 6 Ccc A 6 8
# 7 Ccc B 7 NA
# 8 Ccc C 8 NA
# 9 Ccc D 9 NA