我需要以极其简洁的方式解释这一点,因为我没有统计学的基础知识来更简洁地解释。在SO中询问是因为我正在寻找python解决方案,但如果更合适,可能会转到stats.SE。
我有井下数据,可能有点像这样:
Rt T
0.0000 15.0000
4.0054 15.4523
25.1858 16.0761
27.9998 16.2013
35.7259 16.5914
39.0769 16.8777
45.1805 17.3545
45.6717 17.3877
48.3419 17.5307
51.5661 17.7079
64.1578 18.4177
66.8280 18.5750
111.1613 19.8261
114.2518 19.9731
121.8681 20.4074
146.0591 21.2622
148.8134 21.4117
164.6219 22.1776
176.5220 23.4835
177.9578 23.6738
180.8773 23.9973
187.1846 24.4976
210.5131 25.7585
211.4830 26.0231
230.2598 28.5495
262.3549 30.8602
266.2318 31.3067
303.3181 37.3183
329.4067 39.2858
335.0262 39.4731
337.8323 39.6756
343.1142 39.9271
352.2322 40.6634
367.8386 42.3641
380.0900 43.9158
388.5412 44.1891
390.4162 44.3563
395.6409 44.5837
(Rt变量可以被认为是深度的代理,T是温度)。我还有'先验'数据给出了Rt = 0的温度,并且未显示,我可以用作断点,断点指南或至少与任何发现的断点进行比较的一些标记。
这两个变量的线性关系在某些深度区间受某些过程的影响。简单的线性回归是
q, T0, r_value, p_value, std_err = stats.linregress(Rt, T)
看起来像这样,你可以清楚地看到偏差,并且适合T0(应该是15):
我希望能够执行一系列线性回归(在每个段的末尾加入),但我想这样做: (a)不指明休息的数目或位置, (b)指明休息的数目和地点,及 (c)计算每个段的系数
我认为我可以做(b)和(c)只是将数据分开并分别做一点点小心,但我不知道(a),并想知道是否有某种方式的人知道这可以更简单地完成。
我已经看到了这个:https://stats.stackexchange.com/a/20210/9311,我认为MARS可能是处理它的好方法,但那只是因为它看起来不错;我真的不明白。我用盲目切割的方式尝试了我的数据,并在下面输出,但是再次,我不理解它:
答案 0 :(得分:5)
简短的回答是我使用R来创建线性回归模型解决了我的问题,然后使用segmented
包从线性模型生成分段线性回归。我能够使用n
和psi=NA
指定预期的断点数(或结点)K=n
,如下所示。
答案很长:
R version 3.0.1(2013-05-16)
平台:x86_64-pc-linux-gnu(64位)
# example data:
bullard <- structure(list(Rt = c(5.1861, 10.5266, 11.6688, 19.2345, 59.2882,
68.6889, 320.6442, 340.4545, 479.3034, 482.6092, 484.048, 485.7009,
486.4204, 488.1337, 489.5725, 491.2254, 492.3676, 493.2297, 494.3719,
495.2339, 496.3762, 499.6819, 500.253, 501.1151, 504.5417, 505.4038,
507.6278, 508.4899, 509.6321, 522.1321, 524.4165, 527.0027, 529.2871,
531.8733, 533.0155, 544.6534, 547.9592, 551.4075, 553.0604, 556.9397,
558.5926, 561.1788, 562.321, 563.1831, 563.7542, 565.0473, 566.1895,
572.801, 573.9432, 575.6674, 576.2385, 577.1006, 586.2382, 587.5313,
589.2446, 590.1067, 593.4125, 594.5547, 595.8478, 596.99, 598.7141,
599.8563, 600.2873, 603.1429, 604.0049, 604.576, 605.8691, 607.0113,
610.0286, 614.0263, 617.3321, 624.7564, 626.4805, 628.1334, 630.9889,
631.851, 636.4198, 638.0727, 638.5038, 639.646, 644.8184, 647.1028,
647.9649, 649.1071, 649.5381, 650.6803, 651.5424, 652.6846, 654.3375,
656.0508, 658.2059, 659.9193, 661.2124, 662.3546, 664.0787, 664.6498,
665.9429, 682.4782, 731.3561, 734.6619, 778.1154, 787.2919, 803.9261,
814.335, 848.1552, 898.2568, 912.6188, 924.6932, 940.9083), Tem = c(12.7813,
12.9341, 12.9163, 14.6367, 15.6235, 15.9454, 27.7281, 28.4951,
34.7237, 34.8028, 34.8841, 34.9175, 34.9618, 35.087, 35.1581,
35.204, 35.2824, 35.3751, 35.4615, 35.5567, 35.6494, 35.7464,
35.8007, 35.8951, 36.2097, 36.3225, 36.4435, 36.5458, 36.6758,
38.5766, 38.8014, 39.1435, 39.3543, 39.6769, 39.786, 41.0773,
41.155, 41.4648, 41.5047, 41.8333, 41.8819, 42.111, 42.1904,
42.2751, 42.3316, 42.4573, 42.5571, 42.7591, 42.8758, 43.0994,
43.1605, 43.2751, 44.3113, 44.502, 44.704, 44.8372, 44.9648,
45.104, 45.3173, 45.4562, 45.7358, 45.8809, 45.9543, 46.3093,
46.4571, 46.5263, 46.7352, 46.8716, 47.3605, 47.8788, 48.0124,
48.9564, 49.2635, 49.3216, 49.6884, 49.8318, 50.3981, 50.4609,
50.5309, 50.6636, 51.4257, 51.6715, 51.7854, 51.9082, 51.9701,
52.0924, 52.2088, 52.3334, 52.3839, 52.5518, 52.844, 53.0192,
53.1816, 53.2734, 53.5312, 53.5609, 53.6907, 55.2449, 57.8091,
57.8523, 59.6843, 60.0675, 60.8166, 61.3004, 63.2003, 66.456,
67.4, 68.2014, 69.3065)), .Names = c("Rt", "Tem"), class = "data.frame", row.names = c(NA,
-109L))
library(segmented) # Version: segmented_0.2-9.4
# create a linear model
out.lm <- lm(Tem ~ Rt, data = bullard)
# Set X breakpoints: Set psi=NA and K=n:
o <- segmented(out.lm, seg.Z=~Rt, psi=NA, control=seg.control(display=FALSE, K=3))
slope(o) # defaults to confidence level of 0.95 (conf.level=0.95)
# Trickery for placing text labels
r <- o$rangeZ[, 1]
est.psi <- o$psi[, 2]
v <- sort(c(r, est.psi))
xCoord <- rowMeans(cbind(v[-length(v)], v[-1]))
Z <- o$model[, o$nameUV$Z]
id <- sapply(xCoord, function(x) which.min(abs(x - Z)))
yCoord <- broken.line(o)[id]
# create the segmented plot, add linear regression for comparison, and text labels
plot(o, lwd=2, col=2:6, main="Segmented regression", res=TRUE)
abline(out.lm, col="red", lwd=1, lty=2) # dashed line for linear regression
text(xCoord, yCoord,
labels=formatC(slope(o)[[1]][, 1] * 1000, digits=1, format="f"),
pos = 4, cex = 1.3)
答案 1 :(得分:1)
你想要的是技术上称为spline interpolation,特别是order-1样条插值(它将连接直线段; order-2连接抛物线等)。
这里已经有一个关于Stack Overflow处理Python中Spline Interpolation的问题,它将帮助您解决问题。这是link。如果您在尝试这些提示后还有其他问题,请回复。
答案 2 :(得分:1)
本文第30-31页提供了一个非常简单的方法(不是迭代,没有初始猜测,没有限制指定):https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf。结果是:
注意:该方法基于积分方程的拟合。本例不是一个有利的例子,因为这些点的横坐标的分布很不规则(大范围内没有点)。这使得数值积分不太准确。然而,分段拟合出人意料地不错。