我想使用线性回归预测销售。 这是我用于建模的数据表。
> store
Store Sales CompetitionDistance CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear Assortment_a
1: 3 8314 14130 12 2006 1 14 2011 1
2: 3 8977 14130 12 2006 1 14 2011 1
3: 3 7610 14130 12 2006 1 14 2011 1
4: 3 8864 14130 12 2006 1 14 2011 1
5: 3 8107 14130 12 2006 1 14 2011 1
---
775: 3 12247 14130 12 2006 1 14 2011 1
776: 3 4523 14130 12 2006 1 14 2011 1
777: 3 6069 14130 12 2006 1 14 2011 1
778: 3 5902 14130 12 2006 1 14 2011 1
779: 3 6823 14130 12 2006 1 14 2011 1
Assortment_b Assortment_c StoreType_a StoreType_b StoreType_c StoreType_d DayOfWeek Open Promo SchoolHoliday DateYear DateMonth
1: 0 0 1 0 0 0 5 1 1 1 2015 7
2: 0 0 1 0 0 0 4 1 1 1 2015 7
3: 0 0 1 0 0 0 3 1 1 1 2015 7
4: 0 0 1 0 0 0 2 1 1 1 2015 7
5: 0 0 1 0 0 0 1 1 1 1 2015 7
---
775: 0 0 1 0 0 0 1 1 1 0 2013 1
776: 0 0 1 0 0 0 6 1 0 0 2013 1
777: 0 0 1 0 0 0 5 1 0 1 2013 1
778: 0 0 1 0 0 0 4 1 0 1 2013 1
779: 0 0 1 0 0 0 3 1 0 1 2013 1
DateDay DateWeek StateHoliday_0 StateHoliday_a StateHoliday_b StateHoliday_c CompetitionOpen PromoOpen IspromoinSales Prediction
1: 31 30 1 0 0 0 103 52.00 1 0
2: 30 30 1 0 0 0 103 52.00 1 0
3: 29 30 1 0 0 0 103 52.00 1 0
4: 28 30 1 0 0 0 103 52.00 1 0
5: 27 30 1 0 0 0 103 52.00 1 0
---
775: 7 1 1 0 0 0 73 20.75 1 0
776: 5 0 1 0 0 0 73 20.50 1 0
777: 4 0 1 0 0 0 73 20.50 1 0
778: 3 0 1 0 0 0 73 20.50 1 0
779: 2 0 1 0 0 0 73 20.50 1 0
>
因为我收到错误
对比只适用于至少有两个等级的因素
我应用了@Scott所说的here,因为我没有任何NA值。
我需要知道哪些列应该在模型中转换为因子变量。
> lapply(store, function(x) ifelse(is.factor(x) | is.integer(x), levels(factor(x)), "numeric"))
$Store
[1] "3"
$Sales
[1] "numeric"
$CompetitionDistance
[1] "14130"
$CompetitionOpenSinceMonth
[1] "12"
$CompetitionOpenSinceYear
[1] "2006"
$Promo2
[1] "1"
$Promo2SinceWeek
[1] "14"
$Promo2SinceYear
[1] "2011"
$Assortment_a
[1] "1"
$Assortment_b
[1] "0"
$Assortment_c
[1] "0"
$StoreType_a
[1] "1"
$StoreType_b
[1] "0"
$StoreType_c
[1] "0"
$StoreType_d
[1] "0"
$DayOfWeek
[1] "1"
$Open
[1] "1"
$Promo
[1] "0"
$SchoolHoliday
[1] "0"
$DateYear
[1] "numeric"
$DateMonth
[1] "numeric"
$DateDay
[1] "numeric"
$DateWeek
[1] "numeric"
$StateHoliday_0
[1] "1"
$StateHoliday_a
[1] "0"
$StateHoliday_b
[1] "0"
$StateHoliday_c
[1] "0"
$CompetitionOpen
[1] "numeric"
$PromoOpen
[1] "numeric"
$IspromoinSales
[1] "numeric"
$Prediction
[1] "numeric"
然后我的模型如下所示。只需查看 lm 函数,我该如何编写它。
M<-matrix(0,nrow=10,ncol = 1)
store <- data[Store == 3,] # Pour sélectionner un magasin identifié par son numéro unique
shuffledIndices <- sample(nrow(store)) # Pour faire melanger les données et les réarranger
setDT(store)[,Prediction:=0]
z <- nrow(store)
for (i in 1:10)
{ # 10-fold cross-validation
sampleIndex <- floor(1+0.1*(i-1)*z):(0.1*i*z) # 10 % de la totalité de la base est sélectionné
test <- store[shuffledIndices[sampleIndex],] # il est utilisé comme base de test
train <- store[shuffledIndices[-sampleIndex],] # il est utilisé comme base de train
modell <- lm(Sales ~ as.factor(CompetitionDistance) + as.factor(CompetitionOpenSinceMonth) + as.factor(CompetitionOpenSinceYear) +
as.factor(Promo2)+as.factor(Promo2SinceWeek)+as.factor(Promo2SinceYear)+as.factor(Assortment_a)+as.factor(Assortment_b)+as.factor(Assortment_c)+
as.factor(StoreType_a)+as.factor(StoreType_b)+as.factor(StoreType_c)+as.factor(StoreType_d)+as.factor(DayOfWeek)+as.factor(Open)+SchoolHoliday+
as.factor(Promo)+as.factor(StateHoliday_0)+as.factor(StateHoliday_a)+as.factor(StateHoliday_b)+as.factor(StateHoliday_c)+
as.factor(DateYear)+as.factor(DateMonth)+as.factor(DateDay)+as.factor(DateWeek)+as.factor(CompetitionOpen)+as.factor(PromoOpen)+as.factor(IspromoinSales),train) # a linear model is fitted to the training set
store[shuffledIndices[sampleIndex],Prediction:=predict(modell,test)] # predictions are generated for the test set based on the model
M[i,1]<-(round(sqrt(mean((store$Prediction-test$Sales)^2))/mean(test$Sales),4))
}
plot(1:10,M[,1],type='b',xlab="i",ylab="rmse%")
但我总是得到错误。这真的很奇怪。 你怎么解释这个? 提前谢谢
答案 0 :(得分:2)
问题是你的模型中有常量变量。这些变量不会添加信息,因此应排除在建模过程之外 为什么?您希望在给出所有其他变量的情况下为Sales建模。由于一些变量是不变的,因此他们不会提供任何有关销售变化的信息,因为这些变量不会发生变化。
如果您按以下方式修改模型,则代码应该有效:
modell <- lm(Sales ~ as.factor(DayOfWeek) + SchoolHoliday + as.factor(Promo) +
as.factor(DateYear) + as.factor(DateMonth) + as.factor(DateDay) +
as.factor(DateWeek) + as.factor(CompetitionOpen) + as.factor(PromoOpen),
data = train)
另外一句话:
您正在将所有变量转换为因子。例如PromoOpen
似乎是一个数字变量,将此变量保持为数字可能更好。这当然取决于您的数据和模型的理想解释。