Question

我想使用分类器模型SVM和Rapidminer工具对文本数据进行分类。分类可以是多类型的。由于我的数据是文本类型，因此SVM如何用于此分类。我知道SVM仅适用于数字数据。

Answer 1

您正在寻找的缺失部分称为“单词矢量”。基本上，您必须创建一个新的示例集，其中单个属性将表示单个单词。对于给定的示例（即文档），该属性的（数值）值将显示该文档对该单词的“重要性”。

一种天真的方法是使用文档中单词的计数，但通常你应该使用TD-IDF（术语频率 - 逆文档频率），这也将考虑整个文档语料库。

要在RapidMiner中执行此操作，您必须安装文本挖掘扩展并使用“从数据处理文档”或“从文件处理文档”等操作符。请记住，对于文本挖掘，您需要执行更多预处理步骤，例如创建令牌，删除停用词（几乎所有文档中都可以找到的常用词，因此不是很有帮助）并使用词干（所以“字”和“字”将被平等对待。

这是一个小例子：

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.009">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.3.009" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document" width="90" x="45" y="75">
        <parameter key="text" value="I want to classify text data using classifier model SVM with Rapidminer tool. Classification would be of multilable type. Since my data is of text type, how SVM can be used for this classification. I know that SVM works with numeric data only."/>
      </operator>
      <operator activated="true" class="text:create_document" compatibility="5.3.000" expanded="true" height="60" name="Create Document (2)" width="90" x="45" y="165">
        <parameter key="text" value="The missing piece you are looking for is called &quot;word vector&quot;. Basically you have to create a new example set for which the attributes will represent the words. For a given example (i.e. a document) the (numerical) value for this attribute will show the &quot;importance&quot; of this word for this document. &#10;&#10;A naive approach would be to use the count of the word within the document, but typically you should use TD-IDF (term frequency–inverse document frequency) which will take the whole document corpus into account as well.&#10;&#10;To do this in RapidMiner you have to install the text mining extension and use operators like &quot;Process Documents from Data&quot; or &quot;Process Documents from Files&quot;. Keep in mind that for text mining you will need to conduct more preprocessing steps like creating tokens, removing stop words (common words which you can find in nearly all documents and which are therefore not very helpful) and use the stem of the words (so &quot;word&quot; and &quot;words&quot; will be treated equally).&#10;&#10;Here is a small example:"/>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="5.3.000" expanded="true" height="112" name="Process Documents" width="90" x="179" y="75">
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="5.3.000" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.3.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="179" y="30"/>
          <operator activated="true" class="text:stem_porter" compatibility="5.3.000" expanded="true" height="60" name="Stem (Porter)" width="90" x="313" y="30"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Create Document (2)" from_port="output" to_op="Process Documents" to_port="documents 2"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

BTW：在youtube上还有一些非常好的文本挖掘教程与RapidMiner。

Answer 2

这个问题可能相当陈旧，但也许有更多像我这样的人，只是尝试使用Rapidminer，希望解决完全相同的问题。

我认为关于处理文本的第一部分一般使用Rapidminer的插件“文本挖掘扩展”已经被maerch正式解释了一段时间。但考虑到kailash的评论，主要问题似乎是二项式SVM模型与多项式输入/标签集之间的不兼容性。

SVM模型的实际馈送是通过添加元运算符“Binomial Classification的多项式”作为SVM的包装来完成的。它多次合并输入类（以某种方式可以选择“分类策略”参数），这样总有两个输入组并将它们提供给SVM，直到可以导出组合结果。然后，结果模型能够处理多个类。

下面的流程片段说明了带有Poly2Bi-Wrapper的SVM（默认参数）：

<process expanded="true">
    <operator activated="true" class="polynomial_by_binomial_classification" compatibility="5.3.015" expanded="true" height="76" name="Polynominal by Binominal Classification" width="90" x="112" y="120">
        <parameter key="classification_strategies" value="1 against all"/>
        <parameter key="random_code_multiplicator" value="2.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
            <operator activated="true" class="support_vector_machine_linear" compatibility="5.3.015" expanded="true" height="76" name="SVM (Linear)" width="90" x="179" y="210">
                <parameter key="kernel_cache" value="200"/>
                <parameter key="C" value="0.0"/>
                <parameter key="convergence_epsilon" value="0.001"/>
                <parameter key="max_iterations" value="100000"/>
                <parameter key="scale" value="true"/>
                <parameter key="L_pos" value="1.0"/>
                <parameter key="L_neg" value="1.0"/>
                <parameter key="epsilon" value="0.0"/>
                <parameter key="epsilon_plus" value="0.0"/>
                <parameter key="epsilon_minus" value="0.0"/>
                <parameter key="balance_cost" value="false"/>
                <parameter key="quadratic_loss_pos" value="false"/>
                <parameter key="quadratic_loss_neg" value="false"/>
            </operator>
            <connect from_port="training set" to_op="SVM (Linear)" to_port="training set"/>
            <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
            <portSpacing port="source_training set" spacing="0"/>
            <portSpacing port="sink_model" spacing="0"/>
        </process>
    </operator>
    <connect from_port="training" to_op="Polynominal by Binominal Classification" to_port="training set"/>
    <connect from_op="Polynominal by Binominal Classification" from_port="model" to_port="model"/>
    <portSpacing port="source_training" spacing="0"/>
    <portSpacing port="sink_model" spacing="0"/>
    <portSpacing port="sink_through 1" spacing="0"/>
</process>

请注意，如果在验证操作员的培训区域内以这种方式使用Poly2Bi操作员并且测试区域中有Performance操作员，则RapidMiner的（至少）版本5.3.015会抱怨。 Performance操作符将出现错误消息：

标签和预测必须属于同一类型，但分别是多项式和名义式。

但是在RapidMiner论坛中，他们point out认为这似乎是一个无用的警告，你可以忽略它。就我而言，这个过程也很好。

使用rapidminer进行SVM的多标签分类

2 个答案: