我有一个包含三列的数据框,其信息类似于下面给出的数据框。现在,我希望基于a
列中的信息提取信息搜索模式。
基于少数开发者(@thelatemail和@David T)的支持,我能够使用rle
函数来识别模式,请在此处查看using rle function to identify pattern。现在,我希望继续并将分组信息添加到提取的模式中。我尝试使用dplyr
do
函数-请参阅下面的代码。但是,这不起作用。
示例数据和所需的输出也已提供,供您参考。
##mycode that produces error - needs to be fixed
test <- data%>%
group_by(b, c)%>%
do(., data.frame(from = rle(.$a)$values), to = lead(rle(.$a)$values))
##code to create the data frame
a <- c( "a", "b", "b", "b", "a", "c", "a", "b", "d", "d", "d", "e", "f", "f", "e", "e")
b <- c(rep("experiment", times = 8), rep("control", times = 8))
c <- c(rep("A01", times = 4), rep("A02", times = 4), rep("A03", times = 4), rep("A04", times = 4))
data <- data.frame(c,b,a)
## desired output
c b from to fromCount toCount
<chr> <chr> <int> <int>
1 A01 experimental a b 1 3
2 A02 experimental a c 1 1
3 A02 experimental c a 1 1
4 A02 experimental a b 1 1
5 A03 control d e 3 1
6 A04 control f e 2 2
与先前的帖子here相比,由于我们将分组应用于a
列,因此信息被压缩。
答案 0 :(得分:4)
我们可以使用rleid
中的data.table
library(data.table)
library(dplyr)
data %>%
group_by(b, c, grp = rleid(a)) %>%
summarise(from = first(a), fromCount = n()) %>%
mutate(to = lead(from), toCount = lead(fromCount)) %>%
ungroup %>%
select(-grp) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
或使用rle
,在将'b','c'和summarise
与rle
分组之后,创建一个list
列,然后提取“值”并summarise
中的“ lengths from”列,在lead
的“ from”列中,“ fromCount”列filter
的{{1}}上创建“ to”,“ toCount”元素和基于'c'列的NA
行
arrange
我们还可以使用data %>%
group_by(b, c) %>%
summarise(rl = list(rle(a)),
from = rl[[1]]$values,
fromCount = rl[[1]]$lengths) %>%
mutate(to = lead(from),
toCount = lead(fromCount)) %>%
ungroup %>%
select(-rl) %>%
filter(!is.na(to)) %>%
arrange(c)
# A tibble: 6 x 6
# b c from fromCount to toCount
# <chr> <chr> <chr> <int> <chr> <int>
#1 experiment A01 a 1 b 3
#2 experiment A02 a 1 c 1
#3 experiment A02 c 1 a 1
#4 experiment A02 a 1 b 1
#5 control A03 d 3 e 1
#6 control A04 f 2 e 2
在rle
list
列('rl')上循环,提取组件,并获取{{1}中的map
},lead
中的lengths
,使用values
创建列,tibble
使用unnest_wider
结构,unnest
除去NA元素,{ {1}}
list
答案 1 :(得分:1)
或者在assertThatJson(parsedJson).array("['messageList']").contains("23412341324");
中创建一个函数,该函数为单个主题的跟踪工作tidyverse
rle
确保它的行为
rleSlice <- function(Tracking) {
rlTrack <- rle(as.character(Tracking)) # Strip the levels from the factor, they interfere
tibble(from = rlTrack$values, to = lead(rlTrack$values),
fromCount = rlTrack$lengths, toCount = lead(rlTrack$lengths)) %>%
filter(!is.na(to)) %>%
list()
}
现在,我们将分组并为每个参与者获取 rle
[[1]]
rleSlice(c("a", "b", "b", "b", "c"))
A tibble: 2 x 4
from to fromCount toCount
<chr> <chr> <int> <int>
1 a b 1 3
2 b c 3 1