我有一个以下格式的数据框,我想提取或子集数据框,这样我在每个项目的第一个IndexError
活动之前只有活动:
funding
我期待输出如下:
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
df<- data.frame(project,activity)
有什么建议吗?
答案 0 :(得分:2)
dplyr
:
df %>%
group_by(project) %>%
dplyr::filter(cummin(activity != "funding") == 1)
的产率:
# project activity
# <fctr> <fctr>
# 1 A kickoff
# 2 B kickoff
# 3 B kickoff
# 4 C kickoff
# 5 C delivery
base R
:
do.call(rbind, lapply(split(dff, dff$project), function(x) {
x[cummin(x$activity != "funding") == 1, ]
}))
的产率:
# project activity
# A kickoff
# B kickoff
# B kickoff
# C kickoff
# C delivery
我希望这会有所帮助。
答案 1 :(得分:2)
为了完整起见,这里还有一个data.table
解决方案:
library(data.table)
setDT(df)[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
project activity 1: A kickoff 2: B kickoff 3: B kickoff 4: C kickoff 5: C delivery
在每个project
组中,我们会查找"funding"
列中activity
的第一次出现以及所有后续行的索引:
df[, .I[.I >= first(.I[activity == 'funding'])], by = project]
project V1 1: A 2 2: A 3 3: B 6 4: B 7
在data.table
中,.I
是一个特殊符号,用于保存df
中的行位置。第二个子集.I[.I >= first(.I[activity == 'funding'])]
是必需的,因为which(.I >= first(.I[activity == 'funding']))
只会返回组中的行位置,但不会返回df
内的行位置。
现在,我们已经识别了不显示的行。因此,我们通过排除这些行号得到最终结果:
df[!df[, .I[.I >= first(.I[activity == 'funding'])], by = project]$V1]
如果有可用的日期信息 - 我打赌在处理项目和活动时有一个date
列 - 我们可以按照@Frank的建议进行反非等同加入< / em>使用日期栏:
# create sample date with date column
project<- c('A', 'A', 'A', 'B', 'B', 'B','B', 'C', 'C')
activity<- c('kickoff','funding', 'delivery', 'kickoff','kickoff','funding','kickoff', 'kickoff','delivery')
date <- (as.Date ("2017-10-02") + c(1,4,7,2,5,8,11,3,6))
df <- data.frame(project,activity, date, stringsAsFactors = FALSE)
df <- df[order(df$date), ]
project activity date 1 A kickoff 2017-10-03 4 B kickoff 2017-10-04 8 C kickoff 2017-10-05 2 A funding 2017-10-06 5 B kickoff 2017-10-07 9 C delivery 2017-10-08 3 A delivery 2017-10-09 6 B funding 2017-10-10 7 B kickoff 2017-10-13
# anti non-equi join
setDT(df)[!df[activity == 'funding', first(date), by = project], on = .(project, date >= V1)]
project activity date 1: A kickoff 2017-10-03 2: B kickoff 2017-10-04 3: B kickoff 2017-10-07 4: C kickoff 2017-10-05 5: C delivery 2017-10-08
答案 2 :(得分:2)
data.table
包的其他一些替代方案:
1)Reduce
:
library(data.table)
setDT(df)[df[, .I[!Reduce('+', activity == 'funding', accumulate = TRUE)], project]$V1]
2)cummax
:
library(data.table)
setDT(df)[df[, .I[!cummax(activity == 'funding')], project]$V1]
3)pmax
:
library(data.table)
setDT(df)[!df[, pmax(.I, .I[activity == 'funding']), by = project]$V1]
答案 3 :(得分:0)
您可以尝试cumsum
来跟踪每个项目是否在资助之前或之后进行:
library(dplyr)
df %>%
group_by(project) %>%
mutate(before.funding = cumsum(activity == "funding") == 0) %>%
ungroup() %>%
filter(before.funding) %>%
select(-before.funding)
# A tibble: 5 x 2
project activity
<fctr> <fctr>
1 A kickoff
2 B kickoff
3 B kickoff
4 C kickoff
5 C delivery