我已经四处搜寻,似乎无法弄清楚如何解决这个问题。
我有一个主题数据集,我想对发生在另一列中的事件之后的所有行进行子集化。这是数据集的示例:
subject <- letters[rep(seq(from = 1, to = 5), each = 10)]
value1 <- rnorm(n = length(subject), mean = 20, sd = 5)
value2 <- rnorm(n = length(subject), mean = 30, sd = 10)
tag <- rep(NA, n = length(subject))
df <- data.frame(subject, value1, value2, tag)
# add random events
df[6,4] <- "event"
df[16,4] <- "event"
df[24,4] <- "event"
df[39,4] <- "event"
df[43,4] <- "event"
head(df, 20)
subject value1 value2 tag
1 a 29.48322 28.50112 <NA>
2 a 26.83034 32.61494 <NA>
3 a 19.03148 38.66233 <NA>
4 a 19.97549 36.09613 <NA>
5 a 22.04944 26.80911 <NA>
6 a 16.67589 37.07147 event
7 a 14.25538 32.94055 <NA>
8 a 18.29705 24.17948 <NA>
9 a 14.26047 23.94956 <NA>
10 a 23.91977 39.76018 <NA>
11 b 20.64587 38.93593 <NA>
12 b 20.72713 14.29013 <NA>
13 b 17.55487 27.63619 <NA>
14 b 14.18344 40.30682 <NA>
15 b 11.47055 22.01550 <NA>
16 b 24.60832 38.49901 event
17 b 15.10552 32.08878 <NA>
18 b 23.21466 28.17392 <NA>
19 b 20.59442 34.18078 <NA>
20 b 21.19128 33.50000 <NA>
是否有一种方法可以按主题对从“事件”开始的所有行和“事件”之后的所有行进行子集化?
答案 0 :(得分:3)
根据您想要在子集之后执行的操作,这可能会起作用:
library(tidyverse)
df %>%
group_by(subject) %>%
mutate(event_grp = cumsum(!is.na(tag))) %>%
group_by(subject, event_grp) %>%
summarise(
avg_val1 = mean(value1),
avg_val2 = mean(value2)
)
# subject event_grp avg_val1 avg_val2
# <fct> <int> <dbl> <dbl>
# 1 a 0 22.7 38.6
# 2 a 1 20.5 30.5
# 3 b 0 21.1 25.0
# 4 b 1 21.4 21.2
# 5 c 0 19.5 35.8
# 6 c 1 18.6 23.9
# 7 d 0 18.7 31.1
# 8 d 1 19.4 42.0
# 9 e 0 18.5 25.7
# 10 e 1 20.7 30.2
对于子集,您只需要:
df %>%
group_by(subject) %>%
mutate(event_grp = cumsum(!is.na(tag))) %>%
filter(event_grp >= 1)
答案 1 :(得分:1)
是的,这是基于R的简单解决方案:
'Profiles[TwoOrgGenesis].Consortiums[InsuranceConsortium]' has invalid keys: ChannelCreationPolicy
这里indx <- unlist(lapply(which(df$tag == "event"), "+", 0:1))
df[indx, ]
# subject value1 value2 tag
#6 a 25.996706 15.65917 event
#7 a 20.336984 35.03734 <NA>
#16 b 9.825914 25.34336 event
#17 b 24.344257 30.15755 <NA>
#24 c 18.586266 33.82119 event
#25 c 25.879272 52.43784 <NA>
#39 d 24.366653 25.03767 event
#40 d 19.870183 36.61909 <NA>
#43 e 23.706029 43.46765 event
#44 e 15.091674 29.45431 <NA>
返回“事件”的所有行索引,而which
将向量lapply
(即0和1)添加到所有这些标记中,从而得出“事件” -行”和之后的行。
还有多种其他方式可以获取它:
0:1
这些索引以不同的顺序排列,但始终可以# Alternative 1
indx <- apply(expand.grid(which(df$tag == "event"), 0:1), 1, sum)
# Alternative 2
eindx <- which(df$tag == "event")
indx <- c(eindx, eindx + 1)
来使用它们。
要按主题解决它,您可以检查一下是否将其添加到主题中,如果没有,则排除它:
sort
或者您可以将这些方法包装到一个函数中并利用eindx <- which(df$tag == "event")
not_eq <- which(df$subject[eindx] != df$subject[eindx+1])
indx <- sort(c(eindx, setdiff(eindx, not_eq) + 1))
df[indx, ]
或by
函数:
split
或
get_event <- function(f) {
eindx <- which(f$tag == "event")
indx <- sort(c(eindx, eindx + 1))
f[indx, ]
}
res <- do.call(rbind, by(df, subject, get_event))