如何获得具有先验断点的分段线性回归?

时间:2012-08-24 08:08:06

标签: python r linear-regression piecewise

我需要以极其简洁的方式解释这一点,因为我没有统计学的基础知识来更简洁地解释。在SO中询问是因为我正在寻找python解决方案,但如果更合适,可能会转到stats.SE。

我有井下数据,可能有点像这样:

Rt      T
0.0000  15.0000
4.0054  15.4523
25.1858 16.0761
27.9998 16.2013
35.7259 16.5914
39.0769 16.8777
45.1805 17.3545
45.6717 17.3877
48.3419 17.5307
51.5661 17.7079
64.1578 18.4177
66.8280 18.5750
111.1613    19.8261
114.2518    19.9731
121.8681    20.4074
146.0591    21.2622
148.8134    21.4117
164.6219    22.1776
176.5220    23.4835
177.9578    23.6738
180.8773    23.9973
187.1846    24.4976
210.5131    25.7585
211.4830    26.0231
230.2598    28.5495
262.3549    30.8602
266.2318    31.3067
303.3181    37.3183
329.4067    39.2858
335.0262    39.4731
337.8323    39.6756
343.1142    39.9271
352.2322    40.6634
367.8386    42.3641
380.0900    43.9158
388.5412    44.1891
390.4162    44.3563
395.6409    44.5837

(Rt变量可以被认为是深度的代理,T是温度)。我还有'先验'数据给出了Rt = 0的温度,并且未显示,我可以用作断点,断点指南或至少与任何发现的断点进行比较的一些标记。

这两个变量的线性关系在某些深度区间受某些过程的影响。简单的线性回归是

q, T0, r_value, p_value, std_err = stats.linregress(Rt, T)

看起来像这样,你可以清楚地看到偏差,并且适合T0(应该是15):

enter image description here

我希望能够执行一系列线性回归(在每个段的末尾加入),但我想这样做: (a)不指明休息的数目或位置, (b)指明休息的数目和地点,及 (c)计算每个段的系数

我认为我可以做(b)和(c)只是将数据分开并分别做一点点小心,但我不知道(a),并想知道是否有某种方式的人知道这可以更简单地完成。

我已经看到了这个:https://stats.stackexchange.com/a/20210/9311,我认为MARS可能是处理它的好方法,但那只是因为它看起来不错;我真的不明白。我用盲目切割的方式尝试了我的数据,并在下面输出,但是再次,我不理解它:

enter image description here

3 个答案:

答案 0 :(得分:5)

简短的回答是我使用R来创建线性回归模型解决了我的问题,然后使用segmented包从线性模型生成分段线性回归。我能够使用npsi=NA指定预期的断点数(或结点)K=n,如下所示。

答案很长:

R version 3.0.1(2013-05-16)
平台:x86_64-pc-linux-gnu(64位)

# example data:
bullard <- structure(list(Rt = c(5.1861, 10.5266, 11.6688, 19.2345, 59.2882, 
68.6889, 320.6442, 340.4545, 479.3034, 482.6092, 484.048, 485.7009, 
486.4204, 488.1337, 489.5725, 491.2254, 492.3676, 493.2297, 494.3719, 
495.2339, 496.3762, 499.6819, 500.253, 501.1151, 504.5417, 505.4038, 
507.6278, 508.4899, 509.6321, 522.1321, 524.4165, 527.0027, 529.2871, 
531.8733, 533.0155, 544.6534, 547.9592, 551.4075, 553.0604, 556.9397, 
558.5926, 561.1788, 562.321, 563.1831, 563.7542, 565.0473, 566.1895, 
572.801, 573.9432, 575.6674, 576.2385, 577.1006, 586.2382, 587.5313, 
589.2446, 590.1067, 593.4125, 594.5547, 595.8478, 596.99, 598.7141, 
599.8563, 600.2873, 603.1429, 604.0049, 604.576, 605.8691, 607.0113, 
610.0286, 614.0263, 617.3321, 624.7564, 626.4805, 628.1334, 630.9889, 
631.851, 636.4198, 638.0727, 638.5038, 639.646, 644.8184, 647.1028, 
647.9649, 649.1071, 649.5381, 650.6803, 651.5424, 652.6846, 654.3375, 
656.0508, 658.2059, 659.9193, 661.2124, 662.3546, 664.0787, 664.6498, 
665.9429, 682.4782, 731.3561, 734.6619, 778.1154, 787.2919, 803.9261, 
814.335, 848.1552, 898.2568, 912.6188, 924.6932, 940.9083), Tem = c(12.7813, 
12.9341, 12.9163, 14.6367, 15.6235, 15.9454, 27.7281, 28.4951, 
34.7237, 34.8028, 34.8841, 34.9175, 34.9618, 35.087, 35.1581, 
35.204, 35.2824, 35.3751, 35.4615, 35.5567, 35.6494, 35.7464, 
35.8007, 35.8951, 36.2097, 36.3225, 36.4435, 36.5458, 36.6758, 
38.5766, 38.8014, 39.1435, 39.3543, 39.6769, 39.786, 41.0773, 
41.155, 41.4648, 41.5047, 41.8333, 41.8819, 42.111, 42.1904, 
42.2751, 42.3316, 42.4573, 42.5571, 42.7591, 42.8758, 43.0994, 
43.1605, 43.2751, 44.3113, 44.502, 44.704, 44.8372, 44.9648, 
45.104, 45.3173, 45.4562, 45.7358, 45.8809, 45.9543, 46.3093, 
46.4571, 46.5263, 46.7352, 46.8716, 47.3605, 47.8788, 48.0124, 
48.9564, 49.2635, 49.3216, 49.6884, 49.8318, 50.3981, 50.4609, 
50.5309, 50.6636, 51.4257, 51.6715, 51.7854, 51.9082, 51.9701, 
52.0924, 52.2088, 52.3334, 52.3839, 52.5518, 52.844, 53.0192, 
53.1816, 53.2734, 53.5312, 53.5609, 53.6907, 55.2449, 57.8091, 
57.8523, 59.6843, 60.0675, 60.8166, 61.3004, 63.2003, 66.456, 
67.4, 68.2014, 69.3065)), .Names = c("Rt", "Tem"), class = "data.frame", row.names = c(NA, 
-109L))


library(segmented)  # Version: segmented_0.2-9.4

# create a linear model
out.lm <- lm(Tem ~ Rt, data = bullard)

# Set X breakpoints: Set psi=NA and K=n:
o <- segmented(out.lm, seg.Z=~Rt, psi=NA, control=seg.control(display=FALSE, K=3))
slope(o)  # defaults to confidence level of 0.95 (conf.level=0.95)

# Trickery for placing text labels
r <- o$rangeZ[, 1]
est.psi <- o$psi[, 2]
v <- sort(c(r, est.psi))
xCoord <- rowMeans(cbind(v[-length(v)], v[-1]))
Z <- o$model[, o$nameUV$Z]
id <- sapply(xCoord, function(x) which.min(abs(x - Z)))
yCoord <- broken.line(o)[id]

# create the segmented plot, add linear regression for comparison, and text labels
plot(o, lwd=2, col=2:6, main="Segmented regression", res=TRUE)
abline(out.lm, col="red", lwd=1, lty=2)  # dashed line for linear regression
text(xCoord, yCoord, 
    labels=formatC(slope(o)[[1]][, 1] * 1000, digits=1, format="f"), 
    pos = 4, cex = 1.3)

enter image description here

答案 1 :(得分:1)

你想要的是技术上称为spline interpolation,特别是order-1样条插值(它将连接直线段; order-2连接抛物线等)。

这里已经有一个关于Stack Overflow处理Python中Spline Interpolation的问题,它将帮助您解决问题。这是link。如果您在尝试这些提示后还有其他问题,请回复。

答案 2 :(得分:1)

本文第30-31页提供了一个非常简单的方法(不是迭代,没有初始猜测,没有限制指定):https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf。结果是:

enter image description here

注意:该方法基于积分方程的拟合。本例不是一个有利的例子,因为这些点的横坐标的分布很不规则(大范围内没有点)。这使得数值积分不太准确。然而,分段拟合出人意料地不错。