我正在基于一个非常多样化的数据集进行基于朴素贝叶斯的预测。 str看起来像这样;
'data.frame': 1244 obs. of 24 variables:
$ Opportunity.ID : chr "006D000000YuMhG" "0065700000xKQDI" "0065700000xp0Tq" "0065700000xpxs3" ...
$ Stage : Factor w/ 2 levels "Closed Lost",..: 1 1 2 2 2 2 2 2 1 1 ...
$ Opportunity.Owner : Factor w/ 26 levels "ABA","ALE","BAD",..: 19 7 19 1 17 11 1 7 11 13 ...
$ Solution.Type : Factor w/ 4 levels "","Hybrid","MCS",..: 4 3 4 4 4 4 4 4 4 4 ...
$ New.Business : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 1 2 1 ...
$ Delivery.Countries : Factor w/ 5 levels "India","Netherlands",..: 2 2 2 2 2 2 2 2 2 2 ...
$ Presales.Owner : Factor w/ 23 levels "","ABA","AVR",..: 1 1 1 1 9 1 1 1 1 1 ...
$ Age : int 2604 36 13 2 30 71 1 0 11 396 ...
$ Days.in.current.stage : int 425 425 427 429 428 427 427 426 422 415 ...
$ Days.in.previous.stage : int 0 0 0 0 0 0 0 0 0 0 ...
$ Previous.stage : Factor w/ 10 levels "","Advise and Design Solution",..: 5 5 8 8 8 8 5 8 5 5 ...
$ Industry : Factor w/ 39 levels "Agriculture",..: 12 38 26 22 5 26 6 39 7 6 ...
$ Account.Created.Date : Date, format: "2014-04-30" "2014-04-30" "2014-04-30" "2014-04-30" ...
$ Total.Revenue.Converted : num 0 0 1705 -3360 27596 ...
$ Account.Type : Factor w/ 7 levels "","Customer",..: 6 2 2 2 2 2 2 5 6 2 ...
$ Billing.City : chr "hengelo" "gouda" "appingedam" "rotterdam" ...
$ Shipping.City : chr "hengelo" "gouda" "appingedam" "rotterdam" ...
$ Total.Opportunities : int 2 21 84 36 27 5 5 19 2 25 ...
$ Won.Opportunity.Count : int 0 14 55 26 18 3 5 18 0 15 ...
$ Number.Live.Opportunities: int 0 1 1 4 4 0 0 0 0 2 ...
$ First.Order.Date : Date, format: NA "2013-12-12" "2012-02-10" "2013-12-06" ...
$ Last.Order.Date : Date, format: NA "2020-02-04" "2020-02-27" "2019-08-23" ...
$ Primary.Campaign.Source : Factor w/ 114 levels "","600 Minutes Public IT 2012",..: 4 1 1 1 1 1 1 1 1 1 ...
$ First.Campaign.Touch : Factor w/ 81 levels "","600 min.Executive IT'11",..: 1 1 1 1 1 1 1 1 1 1 ...
尽管这是程序的第一个原型版本,但一切工作都很好,并且该模型在预测舞台时已经非常准确。我只有一件事无法动摇...
如何确定对实际预测影响最大的变量?我在这个主题上进行了很多搜索,但是我看到的大多数示例都是基于不同的算法,或者是基于数值数据集。如何确定此混合数据集中的重要变量?