在R中执行线性回归时使用实验不确定性

时间:2015-12-22 14:56:28

标签: r regression lm

我有一个实验数据集,我想要适合多项式。数据包括独立变量,因变量和后者的测量中的不确定性,例如

using (XmlReader reader = XmlReader.Create(path))
{
    reader.MoveToContent();
    while (reader.ReadToFollowing("trkpt")) // advance the reader to the next trkpt element
    {
        List<double> coordinates = Coordinates(reader);
        double lat = coordinates[0];
        double lon = coordinates[1];
        reader.ReadToFollowing("ele"); // reader.Read() might work here, depending on what state Coordinates left the reader in
        double ele = GetEle(reader);
        reader.ReadToFollowing("time");
        DateTime date = GetDate(reader);
        reader.ReadToFollowing("extensions"); // note: it's not clear what GetSpeed and GetAzimuth need - do they expect the reader to be on the extensions element or the gte:gps element?
        float speed = GetSpeed(reader);
        float azimuth = GetAzimuth(reader);
        GPXDataObject gpxDataObject = new GPXDataObject(lat, lon, ele, date, speed, azimuth);
        gpxDataList.Add(gpxDataObject);
    }         
}

(列 x y + - y )。

我可以使用上面的例子进行多项式拟合,例如

2000  0.2084272   0.002067834
2500  0.207078125 0.001037248
3000  0.2054202   0.001959138
3500  0.203488075 0.000328942
4000  0.2013152   0.000646088
4500  0.198933825 0.001375657
5000  0.196375    0.000908696
5500  0.193668575 0.00014721
6000  0.1908432   0.000526976
6500  0.187926325 0.001217318
7000  0.1849442   0.000556495
7500  0.181921875 0.000401883
8000  0.1788832   0.001446992
8500  0.175850825 0.001235017
9000  0.1728462   0.001676249
9500  0.169889575 0.001011735
10000 0.167       0.000326678

但这并没有利用不确定性值。如何通知R数据集的第三列是不确定的,因此应该用于回归分析?

1 个答案:

答案 0 :(得分:1)

由于因变量中的测量误差与自变量不相关,估计的系数是无偏的,但标准误差太小。这是我使用的参考(第1页和第2页): http://people.stfx.ca/tleo/econ370term2lec4.pdf

我认为您只需要调整lm()计算出的标准误差。这就是我在下面的代码中尝试做的事情。我不是一个统计人员所以你可能想发帖交叉验证并要求更好的直觉。

对于下面的例子,我假设“不确定性”列是标准偏差(或标准误差)。为简单起见,我将模型更改为:y~x。

# initialize dataset
df <- data.frame(
        x = c(2000,2500,3000,3500,4000,4500,5000,5500,6000,6500,7000,7500,8000,8500,9000,9500,10000),
        y = c(0.2084272,0.207078125,0.2054202,0.203488075,0.2013152,0.198933825, 0.196375,0.193668575, 0.1908432, 0.187926325,   0.1849442,   0.181921875, 0.1788832, 0.175850825, 0.1728462,0.169889575,  0.167),
        y.err = c(0.002067834, 0.001037248,  0.001959138, 0.000328942, 0.000646088, 0.001375657, 0.000908696, 0.00014721, 0.000526976, 0.001217318, 0.000556495, 0.000401883, 0.001446992, 0.001235017, 0.001676249, 0.001011735, 0.000326678)
)

df

#  model regression
model <- lm(y~x, data = df)
summary(model)

#  get the variance of the measurement error
#  thanks to: http://schools-wikipedia.org/wp/v/Variance.htm
#  law of total variance
pooled.var.y.err <- mean((df$y.err)^2) + var((df$y.err)^2)

# get variance of beta from model
#   thanks to: http://stats.stackexchange.com/questions/44838/how-are-the-standard-errors-of-coefficients-calculated-in-a-regression
X <- cbind(1, df$x)
#     (if you add more variables to the model you need to modify the following line)
var.betaHat <- anova(model)[[3]][2] * solve(t(X) %*% X) 

# add betaHat variance to measurement error variance
var.betaHat.adj <- var.betaHat + pooled.var.y.err

# calculate adjusted standard errors 
sqrt(diag(var.betaHat.adj))

# compare to un-adjusted standard errors
sqrt(diag(var.betaHat))