我有一个像这样的data.frame
df=data.frame(
grp=c("group1","s1","s2","s3","s4","s5","group2","s6","s7","s8","group2","s9","s10","group3","s11","s12","s13","s14"),
gname=c("gene1",0.00,0.05,0.01,0.01,0.01,"gene1",0.063,0.005,0.015,"gene2",0.07,0.00,"gene3",0.046,0.007,0.011,0.012),
score=c(0.989003844,NA,NA,NA,NA,NA,0.988334014,NA,NA,NA,0.983461712,NA,NA,0.982339339,NA,NA,NA,NA)
)
> df
grp gname score
1 group1 gene1 0.9890038
2 s1 0 NA
3 s2 0.05 NA
4 s3 0.01 NA
5 s4 0.01 NA
6 s5 0.01 NA
7 group2 gene1 0.9883340
8 s6 0.063 NA
9 s7 0.005 NA
10 s8 0.015 NA
11 group2 gene2 0.9834617
12 s9 0.07 NA
13 s10 0 NA
14 group3 gene3 0.9823393
15 s11 0.046 NA
16 s12 0.007 NA
17 s13 0.011 NA
18 s14 0.012 NA
根据组和基因名称,将df分为4个部分。下图显示了这4个部分。
我将根据各列{{1}汇总每个部分的df
,以找到max
的{{1}}和df$score
的{{1}} }和length
。以下df显示了预期结果。
df$grp
如何为每个部分执行df$grp
和df$gname
并将结果保存在data.frame中。
答案 0 :(得分:2)
如果您知道每个组均以不遗漏的分数开头,然后是遗漏的值,则可以结合使用cumsum/is.na
和tapply
。
首先创建一个聚合变量f
。
f <- cumsum(!is.na(df$score))
现在看看结果长度是多少。数字的最上面一行是"names"
属性的值,长度是最下面一行。这些长度包括"group*"
行,因此在最终数据帧中减去1。
tapply(f, f, length)
#1 2 3 4
#6 4 3 5
创建问题要求的结果。
result <- cbind(df[!is.na(df$score), ], length = tapply(f, f, length) - 1)
result
# grp gname score length
#1 group1 gene1 0.9890038 5
#7 group2 gene1 0.9883340 3
#11 group2 gene2 0.9834617 2
#14 group3 gene3 0.9823393 4
如果您进一步想要连续的行名,
row.names(result) <- NULL
答案 1 :(得分:2)
带有tidyverse
library(dplyr)
df %>%
group_by(grp1 = cumsum(grepl("group", grp))) %>%
mutate(length = n() -1) %>%
slice(1) %>%
ungroup %>%
select(-grp1)
# A tibble: 4 x 4
# grp gname score length
# <fct> <fct> <dbl> <dbl>
#1 group1 gene1 0.989 5
#2 group2 gene1 0.988 3
#3 group2 gene2 0.983 2
#4 group3 gene3 0.982 4