Question

我在R中开发了一个系统，用于绘制从风力涡轮机获得的大型数据集。我现在将进程移植到Java中。我在两个系统之间得到的结果是不一致的。

如下图所示：

首先使用R绘制数据集，然后使用JFreeChart绘制数据集。
两个图中的红线对应于我在各种语言中的各自计算（详情如下）。
＃1中的棕色虚线对应于＃2中的蓝线，此处没有差异，提供参考
阴影区域代表数据点，＃1为灰色，＃2为红色。

我可以解释（红色）计算线之间的差异，这是由于我使用不同的计算方法。

在R中，数据处理如下，我写了这段代码with a little help并且不知道这里发生了什么（但是，嘿，它有效）。

df <- data.frame(pwr = pwr, spd = spd)
require(mgcv)
mod <- gam(pwr ~ s(spd, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
x_grid <- with(df, data.frame(spd = seq(min(spd) + 0.0001, maxi, length=100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
lines(fit ~ spd, data = x_grid, col = "red", lwd = thickLineWidth)

在Java（SQL infact）中，我使用bin的方法来计算x轴上每0.5的平均值。使用org.jfree.chart.renderer.xy.XYSplineRenderer绘制结果数据我不太了解线的呈现方式。

SELECT 
    ROUND( ROUND( x_data * 2 ) / 2, 1)   AS x_axis, # See https://stackoverflow.com/questions/5230647/mysql-rounding-functions
    AVG( y_data )                        AS y_axis 
FROM 
    table 
GROUP BY 
    x_axis

我对两个图表之间的差异采取了看法：

x_axis上出现单个异常值18（在R图上最明显）似乎对曲线的形状产生了巨大影响。
即使在x轴上的5到15之间，R图中的线似乎更连续，它不会像Java产生的那样容易地改变轨迹。
在java x轴上18处显而易见的“陨石坑”必须“m”它的每一边，我相信这是由于渲染系统中的多项式效应。

这些是我想要消除的内容。

因此，为了理解两个图表之间的区别，我有几个问题：

我的R脚本究竟发生了什么？
我怎样才能，或者，我是否希望将相同的进程移植到我的Java代码中？
任何人都可以解释JFreeCharts使用的样条系统，还有另一个吗？

Answer 1

在R代码中，你是（当我展示这个例子的时候），在功率和速度数据中拟合一个加法模型，其中变量之间的关系由数据本身决定。这些模型涉及使用样条来估计响应函数。特别是在这里，我使用了自适应平滑器k = 20更平滑拟合的复杂性。平滑器越复杂，拟合函数就越。自适应平滑器是指平滑度在拟合函数中变化的平滑度。

为什么这很重要？那么，根据您的数据，有些时段响应不随速度变量而变化，而响应随速度变化而快速变化的时段也是如此。我们有一个摇摆不定的“余量”用于曲线。对于普通样条曲线，整个函数的摆动（或平滑度）是相同的。通过自适应平滑，我们可以在响应变化/变化最大的函数部分中使用更多的摆动余量，而不在响应不变的部分中不需要任何余量。

下面我注释代码以解释每一步的工作：

## here we create a data frame with the pwr and spd variables df <- data.frame(pwr = pwr, spd = spd) ## we load the package containing the code to fit the additive model require(mgcv) ## This is the model itself, saying pwr is modelled as a smooth function of spd ## and the smooth function of spd is generated using an adaptive smoother with ## and "allowance" of 20. This allowance is a starting point and the actual ## smoothness of the curve will be estimated as part of the model fitting, ## here using a REML criterion mod <- gam(pwr ~ s(spd, bs = "ad", k = 20), data = df, method = "REML") ## This just summarise the model fit summary(mod) ## In this line we are creating a new spd vector (in a data frame) that contains ## 100 equally spaced spd values over the entire range of the observed spd x_grid <- with(df, data.frame(spd = seq(min(spd) + 0.0001, maxi, length=100))) ## we will use those data to get predictions of the response pwr at each ## of the 100 values of spd we just created ## I did this so we had enough data to plot a nice smooth curve, but without ## having to predict for all the observed values of spd pred <- predict(mod, x_grid, se.fit = TRUE) ## This line stores the 100 predicted values in the prediction data object x_grid <- within(x_grid, fit <- pred$fit) ## This line draws the fitted smooth on to a plot of the data ## this assumes there is already a plot on the active device. lines(fit ~ spd, data = x_grid, col = "red", lwd = thickLineWidth)

如果您不熟悉添加剂模型和平滑/样条曲线，那么我推荐Ruppert，Wand和Carroll（2003）Semiparametric Regression。剑桥大学出版社。

计算机绘图实用程序

1 个答案: