I'm trying to analyse some results I obtained from an experience. I have a large dataframe formatted the following way :
Name V1 V2 V3
X1247 1 9 0
X1247.1 1 7 0
X1247.10 2 2 4
X1247.100 8 3 7
X874 3 7 1
X874.1 4 0 1
X66 8 9 1
X66.1 0 1 8
I'm looking to get a dataframe that will merge the rows that have the same name before the ".", so that it would look like this :
Name V1 V2 V3
X1247 12 21 11
X874 7 7 2
X66 8 10 9
How could this be done? This may be trivial but I'm more a biologist than a computer scientist. Thanks
答案 0 :(得分:2)
We group by the substring of 'Name' (str_remove
from stringr
) and use summarise_all
to get the output
library(tidyverse)
df1 %>%
group_by(Name = str_remove(Name, "\\..*")) %>%
summarise_all(sum)
# A tibble: 3 x 4
# Name V1 V2 V3
# <chr> <int> <int> <int>
#1 X1247 12 21 11
#2 X66 8 10 9
#3 X874 7 7 2
Or using base R
with aggregate
and sub
aggregate(.~ Name, transform(df1, Name = sub("\\..*", "", Name)), FUN = sum)
df1 <- structure(list(Name = c("X1247", "X1247.1", "X1247.10", "X1247.100",
"X874", "X874.1", "X66", "X66.1"), V1 = c(1L, 1L, 2L, 8L, 3L,
4L, 8L, 0L), V2 = c(9L, 7L, 2L, 3L, 7L, 0L, 9L, 1L), V3 = c(0L,
0L, 4L, 7L, 1L, 1L, 1L, 8L)), class = "data.frame", row.names = c(NA,
-8L))
答案 1 :(得分:0)
data.table解决方案:
data.table::setDT(df1)[,as.list(colSums(.SD)), by=sub("\\.[^.]*", "", Name),][]
# sub V1 V2 V3
#1: X1247 12 21 11
#2: X874 7 7 2
#3: X66 8 10 9
数据:(从akrun借来)
基准:
df1 <- do.call(rbind,rep(list(mtcars),1000))
df1 <- cbind(Name = paste0(rownames(df1),".HAHA"), df1)
f1 <- function(df1) {
setDT(df1)[,as.list(colSums(.SD)), by=sub("\\.[^.]*", "", Name),][]
}
f2 <- function(df1) {
df1 %>%
group_by(Name = str_remove(Name, "\\..*")) %>%
summarise_all(sum)
}
f3 <- function(df1) {
aggregate(.~ Name, transform(df1, Name = sub("\\..*", "", Name)), FUN = sum)
colSums(df1[,-1])
}
microbenchmark::microbenchmark(f1(df1),f2(df1),f3(df1),times=5)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
#f1(df1) 2982.691 3012.7485 3098.9819 3102.7779 3175.1407 3221.5515 5 c
#f2(df1) 292.278 295.4829 299.7829 296.5712 301.3588 313.2236 5 a
#f3(df1) 2244.550 2254.7699 2318.5398 2256.9791 2274.2181 2562.1817 5 b
令人惊讶的是,@ akrun dplyr解决方案的速度快了10倍……至少让我感到惊讶。