RWeka J48 R和MovieLense数据集中的分类问题

时间:2016-02-07 06:42:01

标签: r weka j48 rweka

我想对Movielense用户表的人口统计数据进行分类,但J48的结果很奇怪,我用C5.0对数据进行分类,一切都很好但是我必须处理这个算法(j48)

我的数据结构如下所示

$ user_id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age       : Factor w/ 7 levels "1","18","25",..: 1 7 3 5 3 6 4 3 3 4 ...
 $ occupation: Factor w/ 21 levels "0","1","2","3",..: 11 17 16 8 21 10 2 13 18 2 ...
 $ gender    : Factor w/ 2 levels "F","M": 1 2 2 2 2 1 2 2 2 1 ...
 $ Class     : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 3 2 2 2 2 4 ...

和数据负责人

head(data)
  user_id age occupation gender Class
1       1   1         10      F     2
2       2  56         16      M     2
3       3  25         15      M     2
4       4  45          7      M     2
5       5  25         20      M     3
6       6  50          9      F     2

user_id以外的所有列均为nominal type,且应为factor in R

分类代码:

library(RWeka)
fit <- J48(data$Class~., data=data[,-c(1)], control = Weka_control(C=0.25))
currentUserClass = predict(fit,data[,-c(1)])
table(currentUserClass , data$Class)

和错误的汇总结果表

currentUserClass    1    2    3    4
               1    0    0    0    0
               2  216 3630 1549  645
               3    0    0    0    0
               4    0    0    0    0

当我用C5.0拟合我的模型时,结果就像下面那样,除了算法

predictions    1    2    3    4
          1  216    0    0    0
          2    0 3630    0    0
          3    0    0 1549    0
          4    0    0    0  645

更多尝试

  1. 我更改了数据的结构并将我的因子列转换为 单独的列,没有任何更改
  2. 我改变了C controller value     结果在C=0.75中有所改善,但这是完全错误的
  3. 标准化后的

    事件和更改数据没有发生

    > head(data)
      user_id       age1      age18      age25      age35      age45      age50
    1       1  5.1188737 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
    2       2 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
    3       3 -0.1953231 -0.4726289  1.3716296 -0.4960755 -0.3164894 -0.2990841
    4       4 -0.1953231 -0.4726289 -0.7289391 -0.4960755  3.1591400 -0.2990841
    5       5 -0.1953231 -0.4726289  1.3716296 -0.4960755 -0.3164894 -0.2990841
    6       6 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894  3.3429880
           age56 occupation1 occupation2 occupation3 occupation4 occupation5
    1 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
    2  3.8590505  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
    3 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
    4 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
    5 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
    6 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
      occupation6 occupation7 occupation8 occupation9 occupation10 occupation11
    1  -0.2016306  -0.3558574 -0.05312294  -0.1243576    5.4744311   -0.1477163
    2  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
    3  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
    4  -0.2016306   2.8096490 -0.05312294  -0.1243576   -0.1826371   -0.1477163
    5  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
    6  -0.2016306  -0.3558574 -0.05312294   8.0399919   -0.1826371   -0.1477163
      occupation12 occupation13 occupation14 occupation15 occupation16 occupation17
    1   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
    2   -0.2619865   -0.1551514   -0.2293967   -0.1562667    4.9049217   -0.3010506
    3   -0.2619865   -0.1551514   -0.2293967    6.3982549   -0.2038431   -0.3010506
    4   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
    5   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
    6   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
      occupation18 occupation19 occupation20    genderM Class
    1   -0.1082744   -0.1098287   -0.2208735 -1.5917949     2
    2   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
    3   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
    4   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
    5   -0.1082744   -0.1098287    4.5267283  0.6281176     3
    6   -0.1082744   -0.1098287   -0.2208735 -1.5917949     2
    > fit <- J48(data$Class~., data=data, control = Weka_control(C=0.25))
    > currentUserClass = predict(fit,data)
    > table(currentUserClass , data$Class)
    
    currentUserClass    1    2    3    4
                   1    7    1    2    2
                   2  201 3601 1470  617
                   3    8   28   75   14
                   4    0    0    2   12
    

1 个答案:

答案 0 :(得分:0)

J48正在实施C4.5 decision tree algorithm。 C5.0和C4.5的性能可能不同。也就是说,可以修改Weka中J48的参数(正如您在上面的代码中所示)。也许这有助于满足您的需求。

首先,您的树可能是预测第2类的单个叶子。可以通过打印决策树来检查。以下代码使用&#34; mtcars&#34; dataset(带R的内置数据集)。

dat <- mtcars 
dat$carb <- factor(dat$carb)
model1 <- J48(carb ~., data = dat)
model1

但是,如果树中使用较少数量的最小对象重建树而不进行修剪,则树将会更大。

model2 <- J48(carb ~., data = dat, control= Weka_control(M=1,U=TRUE))
model2

以下内容可用于检查J48的有效参数:

WOW(J48)

您应该更改J48的默认参数以满足您的特定需求。我建议将C5.0中使用的参数与J48的默认参数进行比较,并在必要时进行修改。