Question

我想知道是否有一种更简洁的方式，而不仅仅是虚拟编码月（例如，isJan，isFeb ......），以获得更有意义的自变量名称（在拦截下）。我的数据集相当大，所以我在这里模拟了一个简单的数据集。

#create simulated data set with sales, and date
sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- seq(from = 14610, to = 15609)
data <- cbind(sales, dates)

#regression with months 
model <- lm(sales ~ months(dates))
summary(model)

我希望截取标签显示它们所指的实际月份...目前我的输出看起来像这样：

                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      999.1934     1.2673 788.432   <2e-16 ***
months(dates).L   -4.9537     4.5689  -1.084   0.2785    
months(dates).Q   -6.4931     4.4211  -1.469   0.1422    
months(dates).C   -5.5078     4.4180  -1.247   0.2128    
months(dates)^4    2.3713     4.4864   0.529   0.5972    
months(dates)^5   -1.7749     4.4605  -0.398   0.6908    
months(dates)^6    1.5774     4.4555   0.354   0.7234    
months(dates)^7  -10.9954     4.4511  -2.470   0.0137 *  
months(dates)^8   -0.9627     4.4032  -0.219   0.8270    
months(dates)^9    1.8847     4.2996   0.438   0.6612    
months(dates)^10  -8.5990     4.1776  -2.058   0.0398 *  
months(dates)^11   7.8436     4.1292   1.900   0.0578 .

提前致谢， --JT

Answer 1

你遇到的问题是R创建了一个有序因子，并且对于有序因子产生的对比度是多项式对比（.L是线性的，.Q是二次的，.C立方和.^n是n阶多项式。最好将月定义为因子，将第一个级别设置为1月，然后拟合模型。

如果在英语语言环境中，我们可以使用month.name或month.abb常量，如下所示

set.seed(42)
dat <- data.frame(sales = rnorm(1000, mean = 1000, sd = 40),
                  dates = as.Date(seq(from = 14610, to = 15609),
                                  origin = "1970-01-01"))
dat <- transform(dat, month = factor(format(dates, format = "%B"),
                                     levels = month.name))

这给出了

> head(dat)
      sales      dates   month
1 1054.8383 2010-01-01 January
2  977.4121 2010-01-02 January
3 1014.5251 2010-01-03 January
4 1025.3145 2010-01-04 January
5 1016.1707 2010-01-05 January
6  995.7550 2010-01-06 January
> with(dat, levels(month))
 [1] "January"   "February"  "March"     "April"     "May"      
 [6] "June"      "July"      "August"    "September" "October"  
[11] "November"  "December"

注意级别的顺序是逻辑顺序而不是字母顺序。如果您使用的是非英语语言环境，则"%B"的输出将是您当地语言或惯例中的月份名称。然后，您需要将正确的级别作为字符向量提供给上面代码中的levels参数。

然后可以使用此数据集来拟合模型，并获得更有意义的系数名称

> mod <- lm(sales ~ month, data = dat)
> summary(mod)

Call:
lm(formula = sales ~ month, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1001.7034     4.1567 240.983   <2e-16 ***
monthFebruary    -8.3618     6.0153  -1.390    0.165    
monthMarch       -0.5347     5.8785  -0.091    0.928    
monthApril       -7.5618     5.9273  -1.276    0.202    
monthMay         -2.2961     5.8785  -0.391    0.696    
monthJune         3.5091     5.9273   0.592    0.554    
monthJuly        -4.9975     5.8785  -0.850    0.395    
monthAugust      -0.3558     5.8785  -0.061    0.952    
monthSeptember    3.7597     5.9970   0.627    0.531    
monthOctober     -2.5948     6.5724  -0.395    0.693    
monthNovember   -10.5670     6.6378  -1.592    0.112    
monthDecember    -6.9064     6.5724  -1.051    0.294    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.01173,    Adjusted R-squared: 0.0007317 
F-statistic: 1.066 on 11 and 988 DF,  p-value: 0.3854

在上文中，请注意1月是第一个级别，因此其平均值为(Intercept)估计值，其他估计值与1月平均值的偏差。模型的另一个参数化是抑制截距：

> mod2 <- lm(sales ~ month - 1, data = dat)
> summary(mod2)

Call:
lm(formula = sales ~ month - 1, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-140.333  -24.551    0.108   28.102  134.349 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
monthJanuary   1001.703      4.157   241.0   <2e-16 ***
monthFebruary   993.342      4.348   228.5   <2e-16 ***
monthMarch     1001.169      4.157   240.9   <2e-16 ***
monthApril      994.142      4.225   235.3   <2e-16 ***
monthMay        999.407      4.157   240.4   <2e-16 ***
monthJune      1005.213      4.225   237.9   <2e-16 ***
monthJuly       996.706      4.157   239.8   <2e-16 ***
monthAugust    1001.348      4.157   240.9   <2e-16 ***
monthSeptember 1005.463      4.323   232.6   <2e-16 ***
monthOctober    999.109      5.091   196.3   <2e-16 ***
monthNovember   991.136      5.175   191.5   <2e-16 ***
monthDecember   994.797      5.091   195.4   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 40.09 on 988 degrees of freedom
Multiple R-squared: 0.9984, Adjusted R-squared: 0.9984 
F-statistic: 5.175e+04 on 12 and 988 DF,  p-value: < 2.2e-16

现在估算是月度均值，而t检验是个人月均值为零（0）的假设。

Answer 2

创建一个作为因子的月份变量，R将自动创建漂亮的名称。

sales <- rnorm(1000, mean = 1000, sd = 40)
dates <- as.Date(seq(from = 14610, to = 15609),origin='1970-01-01')
data <- data.frame(sales, dates)
data$months=as.factor(months(dates))

model <- lm(sales ~ months,data=data)
summary(model)

它会自动选择四月作为对比月份，但您可以使用contrasts更改此内容。

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     1001.3989     4.2880 233.535   <2e-16 ***
monthsAugust       6.8982     6.0150   1.147   0.2517    
monthsDecember    -6.0561     6.7140  -0.902   0.3673    
monthsFebruary    -1.3977     6.1527  -0.227   0.8203    
monthsJanuary     -3.2086     6.0150  -0.533   0.5939    
monthsJuly       -10.0742     6.0150  -1.675   0.0943 .  
monthsJune        -3.3393     6.0641  -0.551   0.5820    
monthsMarch        0.3159     6.0150   0.053   0.9581    
monthsMay         -0.1448     6.0150  -0.024   0.9808    
monthsNovember     3.4901     6.7799   0.515   0.6068    
monthsOctober      3.2082     6.7140   0.478   0.6329    
monthsSeptember   -7.3039     6.1343  -1.191   0.2341

R回归以月为自变量（标签）

2 个答案: