为基准数据创建图表

时间:2015-07-06 20:12:22

标签: r charts ggplot2 aggregate

我正在使用我的前任编写的现有R代码。该代码用于生成PDF报告,以显示来自我们软件的测试运行的数据。

我正在尝试创建的一组图表应该从“基准”结果中绘制百分比变化。该基准测试应该只是我们拥有数据的最早版本。

以下是目前用于构建基准偏差图表的代码部分。

library(ggplot2)

dbhandle <- SQLConn_remote(DBName = "DATABASE", ServerName = "SERVER")
Testdf<-sqlQuery(dbhandle, 'select * from TABLENAME 
                order by FileName, Number, Category', stringsAsFactors = FALSE)
versions<-unique(Testdf[order(Testdf$Number), ][,2])

benchmarks<-aggregate(Value~FileName, subset(Testdf, Number == 1 | Number == 2)[, c('FileName', 'Value')], mean)
names(benchmarks)[2]<-'Benchmark'

Testdf<-merge(Testdf, benchmarks)
Testdf$Version<-factor(Testdf$Version, levels = versions)
Testdf$Deviation<-Testdf$Value- Testdf$Benchmark
Testdf$DeviationP<-(Testdf$Value- Testdf$Benchmark)/Testdf$Benchmark

g<-ggplot(subset(Testdf, !is.na(Value) & Deviation <.5) , aes(color = Value, x = Version, y = Deviation, group = FileName)) + geom_line() +geom_point(aes(shape = Build), size = 1.5) +
  scale_shape_manual(values=c(1,15)) + stat_summary(fun.y=sum, geom="line") + 
  ylab("Run Time Deviation from Benchmark (min)") +  
  scale_colour_gradient(name = 'Run Time',low = 'blue', high = 'red') + 
  theme(axis.text.x = element_text(angle = 90, vjust = .5)) + theme(axis.title.y = element_text(vjust = 1))
g

目前,计算“基准”值的方法不起作用。如果您想查看代码当前的功能,我将为下面的R提供一个示例数据帧。令我困惑的部分是benchmark变量。老实说,我几乎不知道发生了什么。我之前从未使用aggregate()函数,所以语法对我来说完全是陌生的,而且我有一个可怕的时间来跟踪文档(我理解)。最令人困惑的具体部分是subset(Testdf, Number == 1 | Number == 2)。最初代码有Number == 14 | Number == 15。如果我记得,|表示“或”(并且数字条目的数量在30+范围内大得多)。

也许你可以帮助我理解一个聪明的方法来生成我想要制作的图表并帮助我理解这段代码。

编辑:

我想得到一个图表,其中每个条目的每个条目都是Run Time类别,而对于每个FileName,图表从0开始,以显示原始的偏差。我还希望代码选择最早的 Number条目而不只是Number == 1,因为有时可能没有Number == 1的条目。这是我到目前为止所提出的:

versions<-unique(AutoRegdf[order(AutoRegdf$TestNum), ][,2])

benchmarks<-aggregate(Value~Test_Scenario, subset(AutoRegdf, min(AutoRegdf$TestNum) & Measure == 'Run Time')[, c('Test_Scenario', 'Value')], mean)
names(benchmarks)[2]<-'Benchmark'

AutoRegdf<-merge(AutoRegdf, benchmarks)
AutoRegdf$JMPTVersion<-factor(AutoRegdf$JMPTVersion, levels = versions)
AutoRegdf$Deviation<-AutoRegdf$Value- AutoRegdf$Benchmark
AutoRegdf$DeviationP<-(AutoRegdf$Value- AutoRegdf$Benchmark)/AutoRegdf$Benchmark

g<-ggplot(subset(AutoRegdf, Measure == 'Batch Time' & !is.na(Value) & Deviation <.5) , aes(color = Value, x = JMPTVersion, y = Deviation, group = Test_Scenario)) + 
  geom_line(size=.25) + geom_point(aes(shape = Build), size = 1.5) +
  scale_shape_manual(values=c(1,15)) + stat_summary(fun.y=sum, geom="line") + 
  ylab("Run Time Deviation from Benchmark (min)") +  
  scale_colour_gradient(name = 'Run Time (min)',low = 'blue', high = 'red') + 
  theme(axis.text.x = element_text(size = 10, angle = 90, vjust = .5)) + theme(axis.title.y = element_text(vjust = 1)) + 
  theme(plot.margin=unit(c(0,0,0,0),"mm"))
g

如果您想自己重新创建,可以在R中使用此示例数据框。

rw1 <- c("File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3", "File1", "File1", "File1", "File2", "File2", "File2", "File3", "File3", "File3")
rw2 <- c("0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.01", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.02", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03", "0.03")
rw3 <- c("Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final", "Time", "Size", "Final")
rw4 <- c(123, 456, 789, 312, 645, 978, 741, 852, 963, 369, 258, 147, 753, 498, 951, 753, 915, 438, 978, 741, 852, 963, 369, 258, 147, 753, 498)
rw5 <- c("01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12", "01/01/12")
rw6 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3)
rw7 <- c("Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Iteration", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release", "Release")
rw8 <- c("None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "None", "Cannot Connect to Database", "None", "None", "None", "None", "None", "None", "None", "None")


Testdf = data.frame(rw1, rw2, rw3, rw4, rw5, rw6, rw7, rw8)
colnames(Testdf) <- c("FileName", "Version", "Category", "Value", "Date", "Number", "Build", "Error") 

1 个答案:

答案 0 :(得分:2)

我假设你的问题是专门用于计算基准变量。

首先,似乎意图是按文件计算Valuenumber == 1所有行的number == 2的平均值。

这分两步完成。

  1. subset(Testdf, Number == 1 | Number == 2)[, c('FileName', 'Value')],返回编号为12的行,以及FileNameValue列。
  2. aggregate(Value~FileName,subset(*as above*), mean),取值为Filename的平均值。自从我们过滤后,它只考虑符合数字标准的行。
  3. 写的行结果如下:

    >benchmarks 
      FileName Benchmark
    1    File1 357.0
    2    File2 689.5
    3    File3 777.0
    

    然后他们将其合并回文件名框架。这里更明确的代码是:

    Testdf<-merge(Testdf, benchmarks, by = "FileName")
    

    这会产生一个如下所示的数据框:

     FileName Version Category Value     Date Number     Build Error Benchmark
    1    File1    0.01     Time   123 01/01/12      1 Iteration  None       357
    2    File1    0.01     Size   456 01/01/12      1 Iteration  None       357
    3    File1    0.01    Final   789 01/01/12      1 Iteration  None       357
    4    File1    0.02    Final   147 01/01/12      2 Iteration  None       357
    5    File1    0.03    Final   852 01/01/12      3   Release  None       357
    6    File1    0.02     Time   369 01/01/12      2 Iteration  None       357
    

    然后每行具有该文件名的Value的平均值。

    然后他们计算出与此基准的偏差,包括%#

    替代方式

    data.table语法可能更容易理解:

    library(data.table)
    setDT(Testdf)
    Testdf[, Benchmark := mean(Value[Number == 1 | Number == 2]), by = "FileName"]
    

    打破这个局面:

    Testdf[,因为逗号左边没有任何内容,我们将其应用于每一行

    Benchmark := mean(Value[Number == 1 | Number == 2])这会创建一个名为benchmark的新列。基准值是列Value的平均值,但仅适用于数量为12

    的行

    , by = "FileName"]我们将为每个文件名分别计算基准。考虑这一点的一种方法是,我们将获取filename == File1的所有行,然后取Value的平均值。然后取filename == File2所有行并执行相同的操作。 by=参数对FileName的每个唯一值执行此操作。

    后续步骤

    问题是: 代码应该做什么?采取平均值是正确的基准吗?如果是这样,上面的代码可行。该图表看起来很乱,因此您的ggplot代码可能存在问题。澄清这一点将有助于我们帮助您。