Question

我有来自weka的SVM分类的以下输出。我想将SVM分类器输出绘制为异常或正常。如何从这个输出中获取SVM scoring function？

===运行信息===

Scheme:       weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K "weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007"
Relation:     KDDTrain
Instances:    125973
Attributes:   42
              duration
              protocol_type
              service
              flag
              src_bytes
              dst_bytes
              land
              wrong_fragment
              urgent
              hot
              num_failed_logins
              logged_in
              num_compromised
              root_shell
              su_attempted
              num_root
              num_file_creations
              num_shells
              num_access_files
              num_outbound_cmds
              is_host_login
              is_guest_login
              count
              srv_count
              serror_rate
              srv_serror_rate
              rerror_rate
              srv_rerror_rate
              same_srv_rate
              diff_srv_rate
              srv_diff_host_rate
              dst_host_count
              dst_host_srv_count
              dst_host_same_srv_rate
              dst_host_diff_srv_rate
              dst_host_same_src_port_rate
              dst_host_srv_diff_host_rate
              dst_host_serror_rate
              dst_host_srv_serror_rate
              dst_host_rerror_rate
              dst_host_srv_rerror_rate
              class
Test mode:    10-fold cross-validation

===分类器模型（完整训练集）===

SMO

Kernel used:
  Linear Kernel: K(x,y) = <x,y>

Classifier for classes: normal, anomaly

BinarySMO

Machine linear: showing attribute weights, not support vectors.

        -0.0498 * (normalized) duration
 +       0.5131 * (normalized) protocol_type=tcp
 +      -0.6236 * (normalized) protocol_type=udp
 +       0.1105 * (normalized) protocol_type=icmp
 +      -1.1861 * (normalized) service=auth
 +       0      * (normalized) service=bgp
 +       0      * (normalized) service=courier
 +       1      * (normalized) service=csnet_ns
 +       1      * (normalized) service=ctf
 +       1      * (normalized) service=daytime
 +      -0      * (normalized) service=discard
 +      -1.2505 * (normalized) service=domain
 +      -0.6878 * (normalized) service=domain_u
 +       0.9418 * (normalized) service=echo
 +       1.1964 * (normalized) service=eco_i
 +       0.9767 * (normalized) service=ecr_i
 +       0.0073 * (normalized) service=efs
 +       0.0595 * (normalized) service=exec
 +      -1.4426 * (normalized) service=finger
 +      -1.047  * (normalized) service=ftp
 +      -1.4225 * (normalized) service=ftp_data
 +       2      * (normalized) service=gopher
 +       1      * (normalized) service=hostnames
 +      -0.9961 * (normalized) service=http
 +       0.7255 * (normalized) service=http_443
 +       0.5128 * (normalized) service=imap4
 +      -6.3664 * (normalized) service=IRC
 +       1      * (normalized) service=iso_tsap
 +      -0      * (normalized) service=klogin
 +       0      * (normalized) service=kshell
 +       0.7422 * (normalized) service=ldap
 +       1      * (normalized) service=link
 +       0.5993 * (normalized) service=login
 +       1      * (normalized) service=mtp
 +       1      * (normalized) service=name
 +       0.2322 * (normalized) service=netbios_dgm
 +       0.213  * (normalized) service=netbios_ns
 +       0.1902 * (normalized) service=netbios_ssn
 +       1.1472 * (normalized) service=netstat
 +       0.0504 * (normalized) service=nnsp
 +       1.058  * (normalized) service=nntp
 +      -1      * (normalized) service=ntp_u
 +      -1.5344 * (normalized) service=other
 +       1.3595 * (normalized) service=pm_dump
 +       0.8355 * (normalized) service=pop_2
 +      -2      * (normalized) service=pop_3
 +       0      * (normalized) service=printer
 +       1.051  * (normalized) service=private
 +      -0.3082 * (normalized) service=red_i
 +       1.0034 * (normalized) service=remote_job
 +       1.0112 * (normalized) service=rje
 +      -1.0454 * (normalized) service=shell
 +      -1.6948 * (normalized) service=smtp
 +       0.1388 * (normalized) service=sql_net
 +      -0.3438 * (normalized) service=ssh
 +       1      * (normalized) service=supdup
 +       0.8756 * (normalized) service=systat
 +      -1.6856 * (normalized) service=telnet
 +      -0      * (normalized) service=tim_i
 +      -0.8579 * (normalized) service=time
 +      -0.726  * (normalized) service=urh_i
 +      -1.0285 * (normalized) service=urp_i
 +       1.0347 * (normalized) service=uucp
 +       0      * (normalized) service=uucp_path
 +       0      * (normalized) service=vmnet
 +       1      * (normalized) service=whois
 +      -1.3388 * (normalized) service=X11
 +       0      * (normalized) service=Z39_50
 +       1.7882 * (normalized) flag=OTH
 +      -3.0982 * (normalized) flag=REJ
 +      -1.7279 * (normalized) flag=RSTO
 +       1      * (normalized) flag=RSTOS0
 +       2.4264 * (normalized) flag=RSTR
 +       1.5906 * (normalized) flag=S0
 +      -1.952  * (normalized) flag=S1
 +      -0.9628 * (normalized) flag=S2
 +      -0.3455 * (normalized) flag=S3
 +       1.2757 * (normalized) flag=SF
 +       0.0054 * (normalized) flag=SH
 +       0.8742 * (normalized) src_bytes
 +       0.0542 * (normalized) dst_bytes
 +      -1.2659 * (normalized) land=1
 +       2.7922 * (normalized) wrong_fragment
 +       0.0662 * (normalized) urgent
 +       8.1153 * (normalized) hot
 +       2.4822 * (normalized) num_failed_logins
 +       0.2242 * (normalized) logged_in=1
 +      -0.0544 * (normalized) num_compromised
 +       0.9248 * (normalized) root_shell
 +      -2.363  * (normalized) su_attempted
 +      -0.2024 * (normalized) num_root
 +      -1.2791 * (normalized) num_file_creations
 +      -0.0314 * (normalized) num_shells
 +      -1.4125 * (normalized) num_access_files
 +      -0.0154 * (normalized) is_host_login=1
 +      -2.3307 * (normalized) is_guest_login=1
 +       4.3191 * (normalized) count
 +      -2.7484 * (normalized) srv_count
 +      -0.6276 * (normalized) serror_rate
 +       2.843  * (normalized) srv_serror_rate
 +       0.6105 * (normalized) rerror_rate
 +       3.1388 * (normalized) srv_rerror_rate
 +      -0.1262 * (normalized) same_srv_rate
 +      -0.1825 * (normalized) diff_srv_rate
 +       0.2961 * (normalized) srv_diff_host_rate
 +       0.7812 * (normalized) dst_host_count
 +      -1.0053 * (normalized) dst_host_srv_count
 +       0.0284 * (normalized) dst_host_same_srv_rate
 +       0.4419 * (normalized) dst_host_diff_srv_rate
 +       1.384  * (normalized) dst_host_same_src_port_rate
 +       0.8004 * (normalized) dst_host_srv_diff_host_rate
 +       0.2301 * (normalized) dst_host_serror_rate
 +       0.6401 * (normalized) dst_host_srv_serror_rate
 +       0.6422 * (normalized) dst_host_rerror_rate
 +       0.3692 * (normalized) dst_host_srv_rerror_rate
 -       2.5266

Number of kernel evaluations: -1049600465

输出预测 - 样本输出

inst#     actual  predicted error prediction
        1   1:normal   1:normal       1
        2   1:normal   1:normal       1
        3  2:anomaly  2:anomaly       1
        4   1:normal   1:normal       1
        5   1:normal   1:normal       1
        6  2:anomaly  2:anomaly       1
        7  2:anomaly  2:anomaly       1
        8  2:anomaly  2:anomaly       1
        9  2:anomaly  2:anomaly       1
       10  2:anomaly  2:anomaly       1
       11  2:anomaly  2:anomaly       1
       12  2:anomaly  2:anomaly       1
       13   1:normal   1:normal       1
       14  2:anomaly   1:normal   +   1
       15  2:anomaly  2:anomaly       1
       16  2:anomaly  2:anomaly       1
       17   1:normal   1:normal       1
       18  2:anomaly  2:anomaly       1
       19   1:normal   1:normal       1
       20   1:normal   1:normal       1
       21  2:anomaly  2:anomaly       1
       22  2:anomaly  2:anomaly       1
       23   1:normal   1:normal       1
       24   1:normal   1:normal       1
       25  2:anomaly  2:anomaly       1
       26   1:normal   1:normal       1
       27  2:anomaly  2:anomaly       1
       28   1:normal   1:normal       1
       29   1:normal   1:normal       1
       30   1:normal   1:normal       1
       31  2:anomaly  2:anomaly       1
       32  2:anomaly  2:anomaly       1
       33   1:normal   1:normal       1
       34  2:anomaly  2:anomaly       1
       35   1:normal   1:normal       1
       36   1:normal   1:normal       1
       37   1:normal   1:normal       1
       38  2:anomaly  2:anomaly       1
       39   1:normal   1:normal       1
       40  2:anomaly  2:anomaly       1
       41  2:anomaly  2:anomaly       1
       42  2:anomaly  2:anomaly       1
       43   1:normal   1:normal       1
       44   1:normal   1:normal       1
       45   1:normal   1:normal       1
       46  2:anomaly  2:anomaly       1
       47  2:anomaly  2:anomaly       1
       48   1:normal   1:normal       1
       49  2:anomaly   1:normal   +   1
       50  2:anomaly  2:anomaly       1

===按班级详细的准确度===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.986    0.039    0.967      0.986    0.976      0.948    0.973     0.960     normal
                 0.961    0.014    0.983      0.961    0.972      0.948    0.973     0.963     anomaly
Weighted Avg.    0.974    0.028    0.974      0.974    0.974      0.948    0.973     0.962

===混淆矩阵===

     a     b   <-- classified as
 66389   954 |     a = normal
  2301 56329 |     b = anomaly

Answer 1

输出是评分函数。将equals符号作为一个简单的布尔运算符读取，评估为1表示true，0表示false。因此，在分类属性的所有选择中，只有一个系数会影响评分值。

例如，让我们只考虑前三个属性，使用这些标准化输入和结果值：

duration      2.0     -0.0498 * 2.0 => -0.0996
protocol_type icmp     0.1105
service       eco_i    1.1964

请注意其他 protocol_type 和服务条款（例如

-0.6236 * protocol_type = udp

）具有评估为0的比较（ protocol_type = upd 变为 0 ），因此这些系数不会影响总和。

根据这三个属性，到目前为止的分数是这三个术语的总和，或1.2073。继续使用其他39个属性，加上最后的常数-2.5266，这是你的向量得分。

这能解释得好吗？

您引用的博客中的关键词是：

如果评分函数的输出为负，则输入为归类为属于y = -1类。如果得分为正，那么输入被归类为属于y = 1类。

是的，就是这么简单：实现那个漂亮的线性评分函数（42个变量，116个术语）。插入矢量。如果函数出现正数，则向量是正常的;如果它出现负数，则向量是异常的。

是的，您的模型明显比博客的示例长。这个例子基于两个连续的特征;你有42个功能，其中三个是分类功能（因此额外的73个术语）。该示例有3个支持向量;你的将有43（N维需要N + 1支持向量）。然而，即使这个42维模型也按照相同的原则运作：正=正，负=异常。

至于你想要映射到二维显示器...它是可能 ...但我不知道你发现了什么有意义的这个例子。将42个变量映射到3会在我们的空间中造成大量拥塞。我已经在这里和那里看到了一些不错的技巧，尤其是渐变场，其中力矢量与数据点处于相同的空间解释中。天气图设法表示测量的x，y，z坐标，将风速（3D），云层覆盖以及可能的其他几个指标添加到显示中。那可能是10个符号维度。

在你的情况下，我们可能只是将系数小于0.07的尺寸放下来是无关紧要的;这可以节省6个功能。我们可以用颜色，虚线/点线/实心符号以及O或X上的小文本覆盖（正常/异常数据）表示三种分类特征。没有使用笛卡尔位置（x，y，z坐标，假设情节在3D中有意义），那就是9倒了。

但是，我几乎不知道您的数据足以表明我们可能将剩余的33个特征塞入2或3维。你能以某种方式结合任何这些输入吗？多个要素的线性组合是否会为您提供在预测中仍然有意义的结果？

如果没有，那么我们坚持使用规范方法：选择有趣的特征组合（通常是对）。为每个绘制图形，完全忽略其他功能。如果这些都没有视觉意义......那就是我们的答案：不，我们不能很好地绘制数据。对不起，但现实经常在复杂的环境中对我们这样做，我们处理表格中的数据，相关性以及我们可以处理3D思维的其他方法。

Answer 2

为什么不完全不同，但我想它可以解决你的潜在问题。我假设您使用Weka Explorer生成模型。如果您转到Classify tab，请点击More选项...并勾选Output predictions。您可以获得每种分类的概率。这应该允许你绘制正常与异常

对于iris我得到类似

的内容

inst#,    actual, predicted, error, probability distribution
     1 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     2 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     3 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     4 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     5 3:Iris-vir 3:Iris-vir          0      0.333 *0.667
     6 1:Iris-set 1:Iris-set         *0.667  0.333  0    
     7 1:Iris-set 1:Iris-set         *0.667  0.333  0    
     8 1:Iris-set 1:Iris-set         *0.667  0.333  0    
     9 1:Iris-set 1:Iris-set         *0.667  0.333  0    
    10 1:Iris-set 1:Iris-set         *0.667  0.333  0

它包含每个班级的概率。

SVM - 评分功能

2 个答案: