Question

我有一个大小为7500万x 36，{7500万行）的数据框架，其中列为

col1，col1_decile，col2，col2_decile ........... col18，col18_decile

现在我想得到每个列col1，col2 ....... col18对应的汇总统计量（最小值，最大值，平均值和标准偏差）。

即。

的摘要统计

col1 by col1_decile，col2 by col2_decile，col3 by col3_decile ......，col18 by col18_decile

对于可重现的示例，我将使用mtcars数据集：

library(dplyr)
data("mtcars")
mtcars %>% mutate_all(funs(decile = ntile(., 10))) -> mtcars_deciled

head(mtcars_deciled)

这里的列是

mpg，cyl，disp，hp，drat，wt，qsec，vs，am，gear，carb，mpg_decile，cyl_decile，disp_decile，hp_decile，drat_decile，wt_decile qsec_decile，vs_decile，am_decile，gear_decile，carb_decile

我希望最终的data.frame看起来像

decile mpg_decile_min mpg_decile_max mpg_decile_mean mpg_decile_sd ...

所有列的

等等。

每个min，max，mean，std。偏差将根据相应的十分位数列计算

因为它是一个包含7500万行的庞大数据集，所以我正在寻找快速解决方案。我在R中修改了seplyr！，但没有走远。

data.table或dplyr或seplyr的快速解决方案将不胜感激。最终的data.frame应该有10行和73列（每个分解列（18个十分位列）的min，max，mean和sd的4个汇总统计列和公共十进制组列

decile mpg_decile_min mpg_decile_max mpg_decile_mean mpg_decile_sd .... carb_decile_min carb_decile_max carb_decile_mean carb_decile_d

Answer 1

这只是data.table的一种可能性。

问题是数据集的结构在同一行上具有混合的变量类型（十分位数和度量）。您必须重新组织它以使聚合更容易。

以下示例在大数据集（grepl，gsub，ifelse，... ??）上可能会很慢，并且可能会进行优化。整个数据集也有多个副本。也许将每个命令输入下一个wone可能会更好？建议欢迎......

select * FROM [Dot].[dbo].[Users]
  where username LIKE '%WILDFIRE\JoelFarrell%'`

以下是相同代码的管道版：

library(data.table)
library(dplyr)
data("mtcars")

# Your example in data.table format
DT <- as.data.table(mtcars %>% mutate_all(funs(decile = ntile(., 10))))

# Add an ID for each row
DT[,ID := 1:nrow(DT)]

# Transform the dataset in "long" format
tmp <- melt(DT, id.vars = "ID")

# Create a variable to make the distinction between the decile values and the 
# measurements. Maybe not optimal for speed ?
tmp[, decile := ifelse(grepl("_decile$", variable), "decile", "value")]

# Remove the "_decile" suffix
tmp[, variable := gsub("_decile$", "", variable)]

# Cross table to have for each observation, the type of variable, the decile and the value
tmp <- dcast(tmp, ID + variable ~ decile)

# Now it is quite straightforward to compute your summary statistics with data.table syntax
result <- tmp[, .(min = min(value), max = max(value), mean = mean(value), sd = sd(value)), 
    keyby = .(variable, decile)]

print(result, 10)
##      variable decile   min   max     mean         sd
##   1:       am      1 0.000 0.000 0.000000 0.00000000
##   2:       am      2 0.000 0.000 0.000000 0.00000000
##   3:       am      3 0.000 0.000 0.000000 0.00000000
##   4:       am      4 0.000 0.000 0.000000 0.00000000
##   5:       am      5 0.000 0.000 0.000000 0.00000000
##   6:       am      6 0.000 1.000 0.250000 0.50000000
##   7:       am      7 1.000 1.000 1.000000 0.00000000
##   8:       am      8 1.000 1.000 1.000000 0.00000000
##   9:       am      9 1.000 1.000 1.000000 0.00000000
##  10:       am     10 1.000 1.000 1.000000 0.00000000
##  ---                                                
## 101:       wt      1 1.513 1.935 1.724500 0.19428759
## 102:       wt      2 2.140 2.320 2.220000 0.09165151
## 103:       wt      3 2.465 2.770 2.618333 0.15250683
## 104:       wt      4 2.780 3.150 2.935000 0.19215879
## 105:       wt      5 3.170 3.215 3.191667 0.02254625
## 106:       wt      6 3.435 3.440 3.438750 0.00250000
## 107:       wt      7 3.460 3.570 3.516667 0.05507571
## 108:       wt      8 3.570 3.780 3.693333 0.10969655
## 109:       wt      9 3.840 4.070 3.918333 0.13137098
## 110:       wt     10 5.250 5.424 5.339667 0.08712252

Answer 2

tidyverse解决方案

使用<input id=num> <input id=status> <input id=product> <input id=rate>作为数据。在以下解决方案中将mtcars_deciled替换为mtcars以适用于您的情况。假设yourdata列与父列相距固定宽度。

_decile

注意library(tidyverse) numcol <- ncol(mtcars) ans <- map2(seq_len(numcol), names(mtcars), ~mtcars_deciled[,c(.x, .x+numcol)] %>% group_by_at(vars(dplyr::contains("decile"))) %>% summarise_at(vars(.y), funs(mean, sd, min, max)))是必要的，以便将其从dplyr::contains

中消除歧义

输出

这将产生数据帧列表

purrr::contains

按与其对应的特定列的多列分组的摘要统计信息

2 个答案:

tidyverse解决方案

输出