Question

我是一名R neophyte，拥有数据库函数运行时的数据框，其中包含以下数据：

> head(data2)
              dbfunc runtime
1 fn_slot03_byperson  38.083
2 fn_slot03_byperson  32.396
3 fn_slot03_byperson  41.246
4 fn_slot03_byperson  92.904
5 fn_slot03_byperson 130.512
6 fn_slot03_byperson 113.853

数据包含127个离散函数的数据，包括1940170行。

我想：

汇总数据仅包含平均运行时间超过100毫秒的数据库函数
生成25个最慢的数据库函数的箱图，显示运行时的分布，按最慢的顺序排序。

我对摘要步骤感到特别难过。

注意：我也在stats.stackexchange.com问了这个问题。

Answer 1

以下是使用ggplot和plyr的一种方法。您概述的步骤可以结合起来稍微提高效率，但出于学习目的，我会向您展示您提出的步骤。

#Load ggplot and make some fake data
library(ggplot2)
dat <- data.frame(dbfunc = rep(letters[1:10], each = 100)
                  , runtime = runif(1000, max = 300))

#Use plyr to calculate a new variable for the mean runtime by dbfunc and add as 
#a new column
dat <- ddply(dat, "dbfunc", transform, meanRunTime = mean(runtime))

#Subset only those dbfunc with mean run times greater than 100. Is this step necessary?
dat.long <- subset(dat, meanRunTime > 100)


#Reorder the level for the dbfunc variable in terms of the mean runtime. Note that relevel
#accepts a function like mean so if the subset step above isn't necessary, then we can simply
#use that instead.
dat.long$dbfunc <- reorder(dat.long$dbfunc, -dat.long$meanRunTime)

#Subset one more time to get the top *n* dbfunctions based on mean runtime. I chose three here...
dat.plot <- subset(dat.long, dbfunc %in% levels(dbfunc)[1:3])

#Now you have your top three dbfuncs, but a bunch of unused levels hanging out so let's drop them
dat.plot$dbfunc <- droplevels(dat.plot$dbfunc)

#Plotting time!
ggplot(dat.plot, aes(dbfunc, runtime)) + 
  geom_boxplot()

就像我说的那样，我觉得这些步骤中的一些可以合并并提高效率，但是想要向您展示您概述的步骤。

Answer 2

摘要步骤很简单：

attach(data2)
func_mean = tapply(runtime, dbfunc, mean)

广告问题1：

func_mean[func_mean > 100]

广告问题2：

slowest25 = head(sort(func_mean, decreasing = TRUE), n=25)
sl25_data = merge(data.frame(dbfunc = names(slowest25), data2, sort = F)
plot(sl25_data$runtime ~ sl25_data$dbfunc)

希望这会有所帮助。然而箱形图没有在图中排序。

Answer 3

我将此作为'答案'发布，而Tomas和Chases的答案实际上更完整。在Chase的情况下，我无法让ggplot运行，时间很短。在Tomas的案例中，我陷入了sl25_data步骤。

我们最终使用了以下内容，这可以解决剩下的一个问题：

# load data frame
dbruntimes <- read.csv("db_runtimes.csv",sep=',',header=FALSE)
# calc means
meanruns <- aggregate(dbruntimes["runtime"],dbruntimes["dbfunc"],mean)
# filter
topmeanruns <- meanruns[meanruns$runtime>100,]
# order by means
meanruns <- meanruns[rev(order(meanruns$runtime)),]
# get top 25 results
drawfuncs <- meanruns[1:25,"dbfunc"]
# subset for plot
forboxplot <- subset(dbruntimes,dbfunc %in% levels(drawfuncs)[0:25])
# plot
boxplot(forboxplot$runtime~forboxplot$dbfunc)

这为我们提供了我们正在寻找的结果，但所有功能仍然显示在xaxis图上，而不仅仅是前25名。

R：将重复行的数据框汇总到箱图中

3 个答案: