Question

我使用了FSelecter包的OneR算法来找到具有最低错误率的Attribut。我的班级Attribut是，不。我的特征也是肯定的，没有。

OneR算法的结果是：

Ranking-No. 1

Atribut-Name: OR1: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1459-------------------18

Error-Rate: 1459 (0 + 1459)

Ranking-No. 2

Atribut-Name: OR2: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1460-------------------17

Error-Rate: 1460 (0 + 1460)

但是，如果我在相同的数据帧上使用相关函数，那么最好的属性比使用oneR函数的attributs具有更低的错误率。

Atribut-Name: CO4: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25204-------------------39

1(Class: -------------------1348-------------------129

Error-Rate: 1387 (39 + 1348)

有人可以告诉我，为什么OneR算法没有将CO4 Attribut显示为最佳Attribut（基于错误率）？

OneR算法使用哪个标准？

---除了更好地理解我的问题---

完整的数据太大而无法显示。我构建了一个新的数据池，它具有相同的效果

DELAYED - OR1 - CO4 ..

1 --------- 1 -------- 1 -

0 --------- 0 -------- 0 -

0 --------- 0 -------- 1 -

1 --------- 0 -------- 1 -

0 --------- 0 -------- 0 -

1 --------- 0 -------- 1 -

0 --------- 0 -------- 0 -

1 --------- 0 -------- 1 -

显示单个属性的错误率的代码：

print（table（datapool_stackoverflow $ DELAYED，datapool_stackoverflow $ OR1））

OneR功能的代码：

库（FSelector）

oneR_stackoverflow＆lt; - oneR（DELAYED~。，datapool_stackoverflow）

subset_stackoverflow＆lt; - cutoff.k（oneR_stackoverflow，2）

打印（subset_stackoverflow）

相关代码：

cor（as.numeric（datapool_stackoverflow $ DELAYED），as.numeric（datapool_stackoverflow $ OR1））

在这种情况下，结果是：

错误率：OR1 矩阵：------ 0（属性特征） - 1（属性特征

0（类）：--------------------- 4 -------------------- ----- 0

1（分类：--------------------- 3 --------------------- ---- 1

Manuel计算错误率：3（0 + 3）

错误率：CO4 矩阵：------ 0（属性特征） - 1（属性特征

0（类）：----------------------- 3 ------------------ ----- 1

1（分类：----------------------- 0 ------------------- ---- 4

错误率：1（1 + 0）

相关性：归因OR1：0.377 归因于CO4：0.77

OneR：＆＃34; OR1＆＃34;，＆＃34; CO4＆＃34;

为什么，OneR功能是否将OR1 Attribut作为分类的最佳归因？

Answer 1

好的，我有解决方案。该算法计算属性中的特征的错误率总和（与特征的最大值有关）

在这个例子中：

属性OR1：3/7 + 0/1 = 3/7

归因于CO4：0/3 + 1/5 = 0.2

Answer 2

不，应该选择CO4，选择其他属性是错误的 - 看看OneR包（CRAN上提供）给出了什么：

> library(OneR)
> DELAYED <- c(1, 0, 0, 1, 0, 1, 0, 1)
> OR1 <- c(1, rep(0, 7))
> CO4 <- c(1, 0, 1, 1, 0, 1, 0, 1)
> 
> data <- data.frame(DELAYED, OR1, CO4)
> 
> model <- OneR(formula = DELAYED ~., data = data, verbose = T)

    Attribute Accuracy
1 * CO4       87.5%   
2   OR1       62.5%   
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'

> summary(model)

Rules:
If CO4 = 0 then DELAYED = 0
If CO4 = 1 then DELAYED = 1

Accuracy:
7 of 8 instances classified correctly (87.5%)

Contingency table:
       CO4
DELAYED   0   1 Sum
    0   * 3   1   4
    1     0 * 4   4
    Sum   3   5   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 2.1333, df = 1, p-value = 0.1441

> 
> model_2 <- OneR(formula = DELAYED ~ OR1, data = data)
> summary(model_2)

Rules:
If OR1 = 0 then DELAYED = 0
If OR1 = 1 then DELAYED = 1

Accuracy:
5 of 8 instances classified correctly (62.5%)

Contingency table:
       OR1
DELAYED   0   1 Sum
    0   * 4   0   4
    1     3 * 1   4
    Sum   7   1   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 0, df = 1, p-value = 1

您可以在此处找到有关OneR软件包的更多信息：https://github.com/vonjd/OneR

（完全披露：我是这个包的作者）

R

2 个答案: