R

时间:2015-05-20 18:25:41

标签: r decision-tree

我正在尝试分析马拉松数据。我构建了一个简单的模型并创建了一个决策树:

fit <- rpart(timeCategory ~ country + age.group + participated.times, data=data)

我的目标是创建一个用于预测结果的通用公式like in this article (page 4)enter image description here

我怎样才能在R中使用哪些技巧?因此,我希望将一个提供属性的公式作为刺痛。

数据:我使用的一些真实数据可能是downloaded here。 读取数据如下:

data = read.table("data/processedData.txt", header=T)
data$timeCategory <- ntile(data$time, 10)

1 个答案:

答案 0 :(得分:1)

这些是使用时间作为连续值的回归系数,这是在示例中提供的预测类型。它们可用于构建您要求的公式类型。

> lmfit <- lm(time ~ country + age.group + particip.time, data=data)
> lmfit

Call:
lm(formula = time ~ country + age.group + particip.time, data = data)

Coefficients:
      (Intercept)      countryJõgeva  countryLääne-Viru        countryLäti  
         9526.702            345.930            122.513            -73.239  
     countryLeedu       countryPärnu       countryRapla    countrySaaremaa  
          120.592            -78.086           -208.882            114.292  
   countryTallinn       countryTartu    countryViljandi       age.groupM20  
          -37.536             55.771            -70.417           -142.600  
     age.groupM21       age.groupM35       age.groupM40       age.groupM45  
         -218.225           -218.067            -20.108           -196.331  
     age.groupM50      particip.time  
           88.342             -2.487  

如果你想让他们排成一行,那么:

> as.matrix(coef(lmfit))
                         [,1]
(Intercept)       9526.702146
countryJõgeva      345.930334
countryLääne-Viru  122.513294
countryLäti        -73.239333
countryLeedu       120.591585
countryPärnu       -78.086107
countryRapla      -208.882244
countrySaaremaa    114.291592
countryTallinn     -37.535659
countryTartu        55.771326
countryViljandi    -70.416659
age.groupM20      -142.599598
age.groupM21      -218.224754
age.groupM35      -218.066655
age.groupM40       -20.108242
age.groupM45      -196.331263
age.groupM50        88.341978
particip.time       -2.486818

进一步处理文字:

> form <- as.matrix(coef(lmfit))
> rownames(form) <- gsub("try", "try == ", rownames(form) )
> rownames(form) <- gsub("oup", "oup == ", rownames(form) )
> form
                             [,1]
(Intercept)           9526.702146
country == Jõgeva      345.930334
country == Lääne-Viru  122.513294
country == Läti        -73.239333
country == Leedu       120.591585
country == Pärnu       -78.086107
country == Rapla      -208.882244
country == Saaremaa    114.291592
country == Tallinn     -37.535659
country == Tartu        55.771326
country == Viljandi    -70.416659
age.group == M20      -142.599598
age.group == M21      -218.224754
age.group == M35      -218.066655
age.group == M40       -20.108242
age.group == M45      -196.331263
age.group == M50        88.341978
particip.time           -2.486818

几乎完成:

cat(paste( form, paste0("(", rownames(form), ")" ), sep="*", collapse="+\n") )

9526.70214596473*((Intercept))+
345.93033373724*(country == Jõgeva)+
122.51329418344*(country == Lääne-Viru)+
-73.2393326763322*(country == Läti)+
120.591584530399*(country == Leedu)+
-78.0861070429056*(country == Pärnu)+
-208.882244416016*(country == Rapla)+
114.291592299937*(country == Saaremaa)+
-37.5356589458207*(country == Tallinn)+
55.771326363022*(country == Tartu)+
-70.4166587941724*(country == Viljandi)+
-142.599598141679*(age.group == M20)+
-218.224754448193*(age.group == M21)+
-218.066655292225*(age.group == M35)+
-20.1082422022072*(age.group == M40)+
-196.33126335145*(age.group == M45)+
88.3419781798024*(age.group == M50)+
-2.48681789339678*(particip.time)