考虑以下数据框(按ID和时间排序):
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,32,1,2,6,17,24))
df
id event time
1 1 a 1
2 1 b 3
3 1 b 6
4 1 b 12
5 1 a 24
6 1 b 30
7 1 a 42
8 2 a 1
9 2 a 2
10 2 b 6
11 2 a 17
12 2 a 24
我想计算每个“id”组中给定事件序列出现的次数。考虑以下具有时间限制的顺序:
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
这意味着事件“a”可以随时开始,事件“b”必须在事件“a”之后不早于2且不晚于8开始,另一事件“a”必须不早于12开始且不事件“b”之后的18岁之后。 一些创建序列的规则:
seq
可以从第1,3和5行构建。seq
=第8行,第10行和第11行,则不得计算seq
=行8,10和12。预期结果:
df1
id count
1 1 2
2 2 2
R - Identify a sequence of row elements by groups in a dataframe和Finding rows in R dataframe where a column value follows a sequence中存在一些相关问题。
这是使用“dplyr”解决问题的方法吗?
答案 0 :(得分:3)
我相信这是你正在寻找的。它为您提供所需的输出。请注意,在 Dim oExcel As Object
Dim db As DAO.Database
Dim rs As DAO.Recordset
Dim CurrentColumn As Integer
'Make a new instance of Excel
Set oExcel = CreateObject("Excel.Application")
'Have that instance open your workbook
oExcel.Open("YourWorkBookName")
'Open the database
Set db = CurrentDb
'Create a SQL result from a SQL string to pull data from your database
Set rs = db.OpenRecordset(YourSQLString)
'Assign the value of a field in your SQL output to a cell
oExcel.Workbook("YourWorkBookName").Sheets("YourSheetName").Cell(YourCellRow, YourCellColumn") = rs.("FieldName")
中定义time
列时,原始问题中存在拼写错误,其中您有32而不是42。我说这是一个拼写错误,因为它与df
定义之下的输出不匹配。我在下面的代码中将32更改为42。
df
这是输出:
library(dplyr)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
df %>%
full_join(df,by='id',suffix=c('1','2')) %>%
full_join(df,by='id') %>%
rename(event3 = event, time3 = time) %>%
filter(event1 == seq[1] & event2 == seq[2] & event3 == seq[3]) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>%
filter((time2-time1) %>% between(time_LB[2],time_UB[2])) %>%
filter((time3-time2) %>% between(time_LB[3],time_UB[3])) %>%
group_by(id,time1) %>%
slice(1) %>% # slice 1 row for each unique id and time1 (so no duplicate time1s)
group_by(id) %>%
count()
另外,如果省略dplyr管道的最后两部分进行计数(查看它匹配的序列),则会得到以下序列:
# A tibble: 2 x 2
id n
<dbl> <int>
1 1 2
2 2 2
编辑回应关于广义化的评论:是的,可以将其概括为任意长度的序列,但需要一些R伏都教。最值得注意的是,请注意Source: local data frame [4 x 7]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3
<dbl> <fctr> <dbl> <fctr> <dbl> <fctr> <dbl>
1 1 a 1 b 6 a 24
2 1 a 24 b 30 a 42
3 2 a 1 b 6 a 24
4 2 a 2 b 6 a 24
的使用,它允许您在对象列表和Reduce
上应用常用函数,我从foreach
包借用做一些任意的循环。这是代码:
foreach
这输出以下内容:
library(dplyr)
library(foreach)
df <- data.frame(id = c(rep(1,7),rep(2,5)), event = c("a","b","b","b","a","b","a","a","a","b","a","a"), time = c(1,3,6,12,24,30,42,1,2,6,17,24))
seq <- c("a", "b", "a")
time_LB <- c(0, 2, 12)
time_UB <- c(Inf, 8, 18)
multi_full_join = function(df1,df2) {full_join(df1,df2,by='id')}
df_list = foreach(i=1:length(seq)) %do% {df}
df2 = Reduce(multi_full_join,df_list)
names(df2)[grep('event',names(df2))] = paste0('event',seq_along(seq))
names(df2)[grep('time',names(df2))] = paste0('time',seq_along(seq))
df2 = df2 %>% mutate_if(is.factor,as.character)
df2 = df2 %>%
mutate(seq_string = Reduce(paste0,df2 %>% select(grep('event',names(df2))) %>% as.list)) %>%
filter(seq_string == paste0(seq,collapse=''))
time_diff = df2 %>% select(grep('time',names(df2))) %>%
t %>%
as.data.frame() %>%
lapply(diff) %>%
unlist %>% matrix(ncol=2,byrow=TRUE) %>%
as.data.frame
foreach(i=seq_along(time_diff),.combine=data.frame) %do%
{
time_diff[[i]] %>% between(time_LB[i+1],time_UB[i+1])
} %>%
Reduce(`&`,.) %>%
which %>%
slice(df2,.) %>%
filter(time1 %>% between(time_LB[1],time_UB[1])) %>% # deal with time1 bounds, which we skipped over earlier
group_by(id,time1) %>%
slice(1) # slice 1 row for each unique id and time1 (so no duplicate time1s)
如果您只想要计数,则可以Source: local data frame [4 x 8]
Groups: id, time1 [4]
id event1 time1 event2 time2 event3 time3 seq_string
<dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr>
1 1 a 1 b 6 a 24 aba
2 1 a 24 b 30 a 42 aba
3 2 a 1 b 6 a 24 aba
4 2 a 2 b 6 a 24 aba
然后group_by(id)
与原始代码段一样。
答案 1 :(得分:2)
也许将事件序列表示为字符串并使用正则表达式更容易:
df.str = lapply(split(df, df$id), function(d) {
z = rep('-', tail(d,1)$time); z[d$time] = as.character(d$event); z })
df.str = lapply(df.str, paste, collapse='')
# > df.str
# $`1`
# [1] "a-b--b-----b-----------a-----b-----------a"
#
# $`2`
# [1] "aa---b----------a------a"
df1 = lapply(df.str, function(s) length(gregexpr('(?=a.{1,7}b.{11,17}a)', s, perl=T)[[1]]))
> data.frame(id=names(df1), count=unlist(df1))
# id count
# 1 1 2
# 2 2 2