鉴于此data.frame
:
set.seed(4)
df <- data.frame(x = rep(1:5, each = 2), y = sample(50:100, 10, T))
# x y
# 1 1 78
# 2 1 53
# 3 2 93
# 4 2 96
# 5 3 61
# 6 3 82
# 7 4 53
# 8 4 76
# 9 5 91
# 10 5 99
我想编写一些简单的函数(即特征工程)来为x
创建功能,然后将每个结果data.frames
连接在一起。例如:
library(dplyr)
count_x <- function(df) df %>% group_by(x) %>% summarise(count_x = n())
sum_y <- function(df) df %>% group_by(x) %>% summarise(sum_y = sum(y))
mean_y <- function(df) df %>% group_by(x) %>% summarise(mean_y = mean(y))
# and many more...
这可以通过plyr::join_all
完成,但我想知道dplyr
或data.table
是否有更好(或更高效)的方法?
df_with_features <- plyr::join_all(list(count_x(df), sum_y(df), mean_y(df)),
by = 'x', type = 'full')
# > df_with_features
# x count_x sum_y mean_y
# 1 1 2 131 65.5
# 2 2 2 189 94.5
# 3 3 2 143 71.5
# 4 4 2 129 64.5
# 5 5 2 190 95.0
答案 0 :(得分:4)
将@ SimonOHanlon的data.table
方法与@ Jaap的Reduce
和merge
技术结合起来可以产生最高效的结果:
library(data.table)
setDT(df)
count_x_dt <- function(dt) dt[, list(count_x = .N), keyby = x]
sum_y_dt <- function(dt) dt[, list(sum_y = sum(y)), keyby = x]
mean_y_dt <- function(dt) dt[, list(mean_y = mean(y)), keyby = x]
Reduce(function(...) merge(..., all = TRUE, by = c("x")),
list(count_x_dt(df), sum_y_dt(df), mean_y_dt(df)))
更新以包含tidyverse
/ purrr
(purrr::redcue
)方法:
library(tidyverse)
list(count_x(df), sum_y(df), mean_y(df)) %>%
reduce(left_join)
答案 1 :(得分:2)
在data.table
的说法中,这相当于具有已排序的键控data.table并使用该键来连接各种data.tables。
e.g。
require(data.table)
setDT(df) #df is now a data.table
df_count <- df[ , list(count_x=.N),by=x]
df_sum <- df[ , list(sum_y = sum(y)),by=x]
# merge.data.table executes a fast join on the shared key
merge(df_count,df_sum)
# x count_x sum_y
#1: 1 2 129
#2: 2 2 128
#3: 3 2 154
#4: 4 2 182
#5: 5 2 151
在你的例子中,你可能会这样写:
count_x <- function(dt) dt[ , list(N = .N) , keyby=x ]
sum_y <- function(dt) dt[ , list(Sum=sum(y)),keyby=x]
# Then merge...
merge(sum_y(df),count_x(df))
# x Sum N
#1: 1 129 2
#2: 2 128 2
#3: 3 154 2
#4: 4 182 2
#5: 5 151 2