Question

我正在尝试使用glm在R中实现bin-smooth（我也看到它称为步进函数或回归图）。只要没有太多的垃圾箱，它就能完美运行。我首先尝试了这个，但我不能用超过15个箱子来预测它：

binsmoothfit <- glm(mpg ~ cut(displacement, 20), data=auto)
predict(binsmoothfit, data.frame(displacement=min(displacement):max(displacement)))
#Error in model.frame.default(Terms, newdata, na.action = na.action, 
#xlev =object$xlevels) :factor cut(displacement, 20) has new levels (203,223],    
#(281,300],(320,339]

我想这是因为cut函数给出的一些剪切是空的：

table(cut(displacement,20))
#(67.6,87]  (87,106] (106,126] (126,145] (145,165] (165,184] (184,203] (203,223]
# 30        77        58        31        22         9        13         0        
#(223,242] (242,262] (262,281] (281,300] (300,320] (320,339] (339,358] (358,378] 
# 32        25        3         0         42        0         27        4           
#(378,397] (397,417] (417,436] (436,455] 
# 3         13        3         6

所以我尝试使用分位数，但这也行不通。虽然我不太清楚为什么我们在每个剪辑中都有一些数据点：

binsmoothfit <- glm(mpg ~ cut(displacement, breaks=quantile(displacement, 
probs=seq(0,1,1.0/20))), data=auto)   
predict(binsmoothfit, data.frame(displacement=min(displacement):max(displacement)))
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = 
# object$xlevels) : factor cut(displacement, breaks = quantile(displacement, probs = 
#seq(0, 1, 1/20))) has new levels (68,87.3], (87.3,107], (107,126], (126,145], 
#(145,165], (165,184], (184,203], (203,223], (223,242], (242,262], (262,281], 
#(281,300], (300,320], (320,339], (339,358], (358,378], (378,397], (397,416], 
#(416,436], (436,455]

table(cut(displacement,breaks=quantile(displacement, probs=seq(0,1,1.0/20))))
#(68,85]   (85,90]   (90,97]   (97,98]  (98,104] (104,112] (112,120] (120,122]  
#25        18        34        19         3        23        24        18         
#(122,140] (140,148] (148,168] (168,200] (200,231] (231,250] (250,262] (262,305] 
# 27         7         22        19        21        28        10        19         
#(305,318] (318,350] (350,400] (400,455] 
# 24        19        28         9

有谁知道该怎么办？是否有一种很好的方法将间隔合并在一起而没有任何数据点？还是有另一种方法吗？什么“因素有新的水平”意味着什么？我真的很想使用glm，因为我可以自动访问预测，交叉验证等。

我使用的数据是来自UCI机器学习库的Auto MPG数据集：

auto <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
col.names = c("mpg", "cylinders", "displacement", "horsepower", "weight", 
"acceleration", "model_year", "origin", "car_name"))

attach(auto)

使用glm在R中平滑

0 个答案: