在Rapidminer的X-Validation中保证几种技术的相同子集

时间:2014-06-27 10:55:12

标签: validation data-mining rapidminer cross-validation

我处于类数据挖掘项目的特征选择阶段,其主要目的是比较几种数据挖掘技术(Naive Baiyes,SVM等)。在这个阶段,我使用了一个带X-Validation的包装器,如下例所示:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="optimize_selection" compatibility="5.3.008" expanded="true" height="94" name="Optimize Selection (3)" width="90" x="179" y="120">
        <parameter key="generations_without_improval" value="100"/>
        <parameter key="limit_number_of_generations" value="true"/>
        <parameter key="maximum_number_of_generations" value="-1"/>
        <process expanded="true">
          <operator activated="true" class="x_validation" compatibility="5.3.008" expanded="true" height="112" name="Validation" width="90" x="179" y="75">
            <process expanded="true">
              <operator activated="true" class="naive_bayes" compatibility="5.3.008" expanded="true" height="76" name="Naive Bayes (4)" width="90" x="119" y="30"/>
              <connect from_port="training" to_op="Naive Bayes (4)" to_port="training set"/>
              <connect from_op="Naive Bayes (4)" from_port="model" to_port="model"/>
              <portSpacing port="source_training" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            </process>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="5.3.008" expanded="true" height="76" name="Apply Model (8)" width="90" x="45" y="30">
                <list key="application_parameters"/>
              </operator>
              <operator activated="true" class="performance" compatibility="5.3.008" expanded="true" height="76" name="Performance (8)" width="90" x="209" y="30"/>
              <connect from_port="model" to_op="Apply Model (8)" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model (8)" to_port="unlabelled data"/>
              <connect from_op="Apply Model (8)" from_port="labelled data" to_op="Performance (8)" to_port="labelled data"/>
              <connect from_op="Performance (8)" from_port="performance" to_port="averagable 1"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_averagable 1" spacing="0"/>
              <portSpacing port="sink_averagable 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="example set" to_op="Validation" to_port="training"/>
          <connect from_op="Validation" from_port="averagable 1" to_port="performance"/>
          <portSpacing port="source_example set" spacing="0"/>
          <portSpacing port="source_through 1" spacing="0"/>
          <portSpacing port="sink_performance" spacing="0"/>
        </process>
      </operator>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

问题在于,如果我想比较几种技术,我必须保证在交叉验证阶段生成的集合对于所有技术都是相同的,这样我就知道生成结果的准确性是在完全相同的条件下进行的。 。但是在X-Validation运算符中我不能放置多个模型创建运算符,所以我不知道如何保证。

1 个答案:

答案 0 :(得分:0)

Optimize Selection运算符使用内部运算符的性能来确定在向前或向后选择期间要保留或删除的属性。这意味着属性顺序将由内部学习者返回的性能确定。一般来说,不同的内部学习者会产生不同的顺序。如果这是您想要的,那么可以使用Optimize Selection运算符在Multiply运算符中获取示例集的副本,并将其传递给包含其他学习者的另一个验证块。然后,您可以使用Log运算符记录此学习者的性能值以及驱动属性排序的原始值。 Optimize Selection运算符也可以记录其进度,并且可以记录当前正在考虑的功能名称。