将数据分成两个独立的组s.t.具有一个连续预测器的残差平方和被最小化

时间:2015-02-09 03:30:00

标签: r algorithm statistics grouping partitioning

将一组数据分成两组的基本算法是什么?两个单独的残差平方和之和是否最小化?例如,请考虑以下代码。基本上,如何在不迭代测试每个可能值的情况下计算存储在best.cutpoint$RSS中的值?

set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)

dat <- data.frame(ind.var, dep.var)

best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
    # partition data
    dat1 <- dat[dat$ind.var > cutpoint,]
    dat2 <- dat[!(dat$ind.var > cutpoint),]

    if(nrow(dat1) < 2 | nrow(dat2) < 2){
        next
    }
    # estimate
    mod1 <- lm(dep.var ~ ind.var, dat = dat1)
    mod2 <- lm(dep.var ~ ind.var, dat = dat2)

    # calculate RSS
    part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
    part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)

    total <- part1.RSS + part2.RSS

    if(total < best.cutpoint$RSS){
        best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
    }
}

从以下可能值范围生成以下结果。

> print(best.cutpoint)
$RSS
[1] 75241532557

$cutpoint
[1] 34351.46

> range(dat$ind.var)
[1]    66.73151 49996.52975

1 个答案:

答案 0 :(得分:2)

听起来像是在问你如何确定segmented or piecewise linear regression的断点。如果不是这样,请告诉我。

该包用于此目的Segmented

首先让我们生成一些数据:

x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)

此数据如下,

enter image description here

“断点”的位置非常明显。

接下来,让我们使用分段包装模型:

library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)

是否找到了断点?

plot(segmented.mod)
points(x,y)

enter image description here

看起来确实如此。

> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)

Meaningful coefficients of the linear terms:
(Intercept)            x         U1.x  
     0.1818       0.9545       9.0455  

Estimated Break-Point(s) psi1.x : 11.08 

其中seg.z和psi定义为:

seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.