我有数据(我将提供数据的负责人),并想在"准确性"上模拟GLM。我的R工作室一直在冻结我运行最终代码GLM的代码。我不知道该怎么办,我完全被卡住了..
ContractNr Year ValidFrom ValidThru Exposure EarnedPremium
1 3006024 2013 1.1.2013 31.3.2013 0,246575342 53,79877695
2 3006024 2013 1.4.2013 22.4.2013 0,060273973 13,48774798
3 3012819 2013 1.1.2013 31.12.2013 1 367,0053327
4 3012819 2014 1.1.2014 31.12.2014 1 367,0053327
5 3012819 2015 1.1.2015 26.4.2015 0,317808219 116,6373112
6 3014874 2013 1.1.2013 28.2.2013 0,161643836 57,71979747
YearlyNetPremium ClaimNr ClaimDate ClaimYear NClaims Incurred
1 218,1839288 NA NA 0 0
2 223,7740007 NA NA 0 0
3 367,0053327 61861914012 21.8.2013 2013 1 1390,86693
4 367,0053327 NA NA 0 0
5 367,0053327 NA NA 0 0
6 357,080103 NA NA 0 0
Payments Reserve County ConstrYear EngPerfKW Weight BonusMalus Age Gender
1 0 0 GM 1999 40 975 0 51 female
2 0 0 GM 1999 40 975 0 51 female
3 1390,86693 0 L 2003 132 1834 -1 58 female
4 0 0 L 2003 132 1834 -1 59 female
5 0 0 L 2003 132 1834 -1 60 female
6 0 0 PE 2004 55 1318 0 79 male
ClaimReason Make Telematics CarAge G_EngPerfKW G_Weight G_Age
1 NA Renault 0 16 25 500 50
2 NA Renault 0 16 25 500 50
3 1 Audi 0 12 125 1500 50
4 NA Audi 0 12 125 1500 50
5 NA Audi 0 12 125 1500 60
6 NA Opel 0 11 50 1000 70
我想要做的是" NClaims",这是二元制作权重,因此制作GLM。我尝试过类似于机器学习(训练/测试数据)的东西,它已经奏效了。
library(caret)
library(FSelector)
set.seed(42)
dataset<-read.csv(file.choose(),header=T,sep=";")
str(dataset)
dataset$NClaims[is.na(dataset$NClaims)]<-names(which.max(table(dataset$NClaims)))
dataset$ClaimReason<-NULL
dataset$ClaimNr<-NULL
dataset$ClaimDate<-NULL
dataset$ClaimYear<-NULL
dataset$Incurred<-NULL
dataset$Payments<-NULL
dataset$Reserve<-NULL
colSums(is.na(dataset))
dataset$ValidFrom<-NULL
dataset$ValidThru<-NULL
dataset$County<-NULL
dataset$Gender<-NULL
dataset$Make<-NULL
weights_info_gain<-information.gain(NClaims ~ ., data=dataset)
weights_info_gain
weights_gain_ratio = gain.ratio(NClaims ~ ., data=dataset)
weights_gain_ratio
most_important_attributes <- cutoff.k(weights_gain_ratio, 20)
most_important_attributes
formula_with_most_important_attributes <- as.simple.formula(most_important_attributes, "NClaims")
formula_with_most_important_attributes
fitCtrl = trainControl(method="repeatedcv", number=5, repeats=3)
modelGLM = train(formula_with_most_important_attributes, data=dataset, method="glm", trControl=fitCtrl, metric="Accuracy",na.action = na.pass)
我已经扔掉了约会我不确定GLM是否会采取(比如&#34; Make&#34;或者只是没有数字)。谢谢你的帮助!!
答案 0 :(得分:0)
我从问题中尝试re-sampling data
来创建更大的data
集来回答问题。 code
在sample data set
上运行良好:
commas
'columns
','YearlyNetPremium
','Exposure
','EarnedPremium
','{ {1}}'等,以点号'。'Incurred
添加到Payments
(以后将它们设置为NA's
)blank rows
转换为NULL
或strings
date
似乎适用于上面的示例year
:
导入库
Code
通过对问题中的数据进行采样来创建示例数据
data
上面问题中的代码
library(lubridate)
library(caret)
library(FSelector)