我想知道是否有一种更简洁的方式,而不仅仅是虚拟编码月(例如,isJan,isFeb ......),以获得更有意义的自变量名称(在拦截下)。我的数据集相当大,所以我在这里模拟了一个简单的数据集。
#create simulated data set with sales, and date
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- seq(from = 14610, to = 15609)
data <- cbind(sales, dates)
#regression with months
model <- lm(sales ~ months(dates))
summary(model)
我希望截取标签显示它们所指的实际月份...目前我的输出看起来像这样:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 999.1934 1.2673 788.432 <2e-16 ***
months(dates).L -4.9537 4.5689 -1.084 0.2785
months(dates).Q -6.4931 4.4211 -1.469 0.1422
months(dates).C -5.5078 4.4180 -1.247 0.2128
months(dates)^4 2.3713 4.4864 0.529 0.5972
months(dates)^5 -1.7749 4.4605 -0.398 0.6908
months(dates)^6 1.5774 4.4555 0.354 0.7234
months(dates)^7 -10.9954 4.4511 -2.470 0.0137 *
months(dates)^8 -0.9627 4.4032 -0.219 0.8270
months(dates)^9 1.8847 4.2996 0.438 0.6612
months(dates)^10 -8.5990 4.1776 -2.058 0.0398 *
months(dates)^11 7.8436 4.1292 1.900 0.0578 .
提前致谢, --JT
答案 0 :(得分:6)
你遇到的问题是R创建了一个有序因子,并且对于有序因子产生的对比度是多项式对比(.L
是线性的,.Q
是二次的,.C
立方和.^n
是n阶多项式。最好将月定义为因子,将第一个级别设置为1月,然后拟合模型。
如果在英语语言环境中,我们可以使用month.name
或month.abb
常量,如下所示
set.seed(42)
dat <- data.frame(sales = rnorm(1000, mean = 1000, sd = 40),
dates = as.Date(seq(from = 14610, to = 15609),
origin = "1970-01-01"))
dat <- transform(dat, month = factor(format(dates, format = "%B"),
levels = month.name))
这给出了
> head(dat)
sales dates month
1 1054.8383 2010-01-01 January
2 977.4121 2010-01-02 January
3 1014.5251 2010-01-03 January
4 1025.3145 2010-01-04 January
5 1016.1707 2010-01-05 January
6 995.7550 2010-01-06 January
> with(dat, levels(month))
[1] "January" "February" "March" "April" "May"
[6] "June" "July" "August" "September" "October"
[11] "November" "December"
注意级别的顺序是逻辑顺序而不是字母顺序。如果您使用的是非英语语言环境,则"%B"
的输出将是您当地语言或惯例中的月份名称。然后,您需要将正确的级别作为字符向量提供给上面代码中的levels
参数。
然后可以使用此数据集来拟合模型,并获得更有意义的系数名称
> mod <- lm(sales ~ month, data = dat)
> summary(mod)
Call:
lm(formula = sales ~ month, data = dat)
Residuals:
Min 1Q Median 3Q Max
-140.333 -24.551 0.108 28.102 134.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1001.7034 4.1567 240.983 <2e-16 ***
monthFebruary -8.3618 6.0153 -1.390 0.165
monthMarch -0.5347 5.8785 -0.091 0.928
monthApril -7.5618 5.9273 -1.276 0.202
monthMay -2.2961 5.8785 -0.391 0.696
monthJune 3.5091 5.9273 0.592 0.554
monthJuly -4.9975 5.8785 -0.850 0.395
monthAugust -0.3558 5.8785 -0.061 0.952
monthSeptember 3.7597 5.9970 0.627 0.531
monthOctober -2.5948 6.5724 -0.395 0.693
monthNovember -10.5670 6.6378 -1.592 0.112
monthDecember -6.9064 6.5724 -1.051 0.294
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.01173, Adjusted R-squared: 0.0007317
F-statistic: 1.066 on 11 and 988 DF, p-value: 0.3854
在上文中,请注意1月是第一个级别,因此其平均值为(Intercept)
估计值,其他估计值与1月平均值的偏差。模型的另一个参数化是抑制截距:
> mod2 <- lm(sales ~ month - 1, data = dat)
> summary(mod2)
Call:
lm(formula = sales ~ month - 1, data = dat)
Residuals:
Min 1Q Median 3Q Max
-140.333 -24.551 0.108 28.102 134.349
Coefficients:
Estimate Std. Error t value Pr(>|t|)
monthJanuary 1001.703 4.157 241.0 <2e-16 ***
monthFebruary 993.342 4.348 228.5 <2e-16 ***
monthMarch 1001.169 4.157 240.9 <2e-16 ***
monthApril 994.142 4.225 235.3 <2e-16 ***
monthMay 999.407 4.157 240.4 <2e-16 ***
monthJune 1005.213 4.225 237.9 <2e-16 ***
monthJuly 996.706 4.157 239.8 <2e-16 ***
monthAugust 1001.348 4.157 240.9 <2e-16 ***
monthSeptember 1005.463 4.323 232.6 <2e-16 ***
monthOctober 999.109 5.091 196.3 <2e-16 ***
monthNovember 991.136 5.175 191.5 <2e-16 ***
monthDecember 994.797 5.091 195.4 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984
F-statistic: 5.175e+04 on 12 and 988 DF, p-value: < 2.2e-16
现在估算是月度均值,而t检验是个人月均值为零(0)的假设。
答案 1 :(得分:2)
创建一个作为因子的月份变量,R将自动创建漂亮的名称。
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- as.Date(seq(from = 14610, to = 15609),origin='1970-01-01')
data <- data.frame(sales, dates)
data$months=as.factor(months(dates))
model <- lm(sales ~ months,data=data)
summary(model)
它会自动选择四月作为对比月份,但您可以使用contrasts
更改此内容。
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1001.3989 4.2880 233.535 <2e-16 ***
monthsAugust 6.8982 6.0150 1.147 0.2517
monthsDecember -6.0561 6.7140 -0.902 0.3673
monthsFebruary -1.3977 6.1527 -0.227 0.8203
monthsJanuary -3.2086 6.0150 -0.533 0.5939
monthsJuly -10.0742 6.0150 -1.675 0.0943 .
monthsJune -3.3393 6.0641 -0.551 0.5820
monthsMarch 0.3159 6.0150 0.053 0.9581
monthsMay -0.1448 6.0150 -0.024 0.9808
monthsNovember 3.4901 6.7799 0.515 0.6068
monthsOctober 3.2082 6.7140 0.478 0.6329
monthsSeptember -7.3039 6.1343 -1.191 0.2341