Question

我已经删除了很多像这样的易趣游戏：

Apple iPhone 5 White 16GB Dual-Core

我已经用这种方式手动标记了所有这些

B M C S NA

其中B =品牌（Apple）M =型号（iPhone 5）C =颜色（白色）S =尺寸（尺寸）NA =未指定（双核）

现在我需要使用python中的libsvm库训练SVM分类器，以了解ebay标题中出现的序列模式。

我需要通过将问题视为分类来提取该属性（品牌，型号，颜色，尺寸）的新值。通过这种方式，我可以预测新的模型。

我想考虑这个功能：

* Position
- from the beginning of the title
- to the end of the listing
* Orthographic features
- current word contains a digit
- current word is capitalized 
....

我无法理解如何将所有这些信息提供给图书馆。官方文档缺少很多信息

我的班级是品牌，型号，尺寸，颜色，NA

SVM算法的输入文件必须包含什么？

我该如何创建它？考虑到我在问题中提供的4个功能，我可以举一个该文件的示例吗？我是否还可以使用一些代码来举例说明输入文件？

*更新* 我想代表这些功能......我该怎么做？

当前词的身份

我认为我可以用这种方式解释它

0 --> Brand
1 --> Model
2 --> Color
3 --> Size 
4 --> NA

如果我知道这个单词是Brand，我会将该变量设置为1（true）。在训练测试中可以这样做（因为我已经标记了所有单词）但是我怎样才能为测试集做到这一点？我不知道单词的类别（这就是我学习它的原因：D）。

当前单词的N-gram子串特征（N = 4,5,6）不知道，这意味着什么？
当前单词前2个单词的标识。我该如何建模此功能？

考虑到我为第一个特征创建的图例，我有5 ^（5）组合

00 10 20 30 40
01 11 21 31 41
02 12 22 32 42
03 13 23 33 43
04 14 24 34 44

如何将其转换为libsvm（或scikit-learn）可以理解的格式？

属性的4个词典的成员资格

我怎么能这样做？拥有4个字典（颜色，大小，型号和品牌）我必须创建一个bool变量，当我和4个字典中的一个字典中的当前单词匹配时，我将设置为true。

品牌词典的独家会员资格

我认为像4.功能一样，我必须使用bool变量。你同意吗？

Answer 1

以下是有关如何使用数据训练SVM然后使用相同数据集进行评估的分步指南。它也可以在http://nbviewer.ipython.org/gist/anonymous/2cf3b993aab10bf26d5f获得。在网址上，您还可以看到中间数据的输出和结果的准确性（它是iPython notebook）

步骤0：安装依赖项

您需要安装以下库：

熊猫
scikit学习

从命令行：

pip install pandas
pip install scikit-learn

第1步：加载数据

我们将使用pandas加载我们的数据。 pandas是一个可以轻松加载数据的库。为了说明，我们先保存将数据采样到csv，然后加载它。

我们将使用train.csv训练SVM并获取带有test.csv

的测试标签

import pandas as pd

train_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1"""


with open('train.csv', 'w') as output:
    output.write(train_data_contents)

train_dataframe = pd.read_csv('train.csv')

第2步：处理数据

我们会将数据帧转换为numpy数组，这是一种scikit- 学会理解。

我们需要将标签＆＃34; B＆＃34;，＆＃34; M＆＃34;，＆＃34; C＆＃34;，...转换为数字，因为svm确实如此不懂字符串。

然后我们将训练带有数据的线性svm

import numpy as np

train_labels = train_dataframe.class_label
labels = list(set(train_labels))
train_labels = np.array([labels.index(x) for x in train_labels])
train_features = train_dataframe.iloc[:,1:]
train_features = np.array(train_features)

print "train labels: "
print train_labels
print 
print "train features:"
print train_features

我们在这里看到train_labels（5）的长度与多少行完全匹配我们在trainfeatures。 train_labels中的每个项目对应一行。

第3步：训练SVM

from sklearn import svm
classifier = svm.SVC()
classifier.fit(train_features, train_labels)

步骤4：评估某些测试数据的SVM

test_data_contents = """
class_label,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
N,12,0,0,1
"""

with open('test.csv', 'w') as output:
    output.write(test_data_contents)

test_dataframe = pd.read_csv('test.csv')

test_labels = test_dataframe.class_label
labels = list(set(test_labels))
test_labels = np.array([labels.index(x) for x in test_labels])

test_features = test_dataframe.iloc[:,1:]
test_features = np.array(test_features)

results = classifier.predict(test_features)
num_correct = (results == test_labels).sum()
recall = num_correct / len(test_labels)
print "model accuracy (%): ", recall * 100, "%"

链接＆amp;提示

如何加载LinearSVC的示例代码：http://scikitlearn.org/stable/modules/svm.html#svm
scikit-learn示例的长列表：http://scikitlearn.org/stable/auto_examples/index.html。我发现这些有点温和，但是经常让自己感到困惑。
如果您发现SVM需要很长时间训练，请尝试使用LinearSVC 相反：http://scikitlearn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
这是另一个熟悉机器学习模型的教程：http://scikit-learn.org/stable/tutorial/basic/tutorial.html

您应该能够使用此代码并将train.csv替换为您的测试数据test.csv，并获取测试数据的预测以及准确性结果。

请注意，由于您使用所训练的数据进行评估，因此准确度会异常高。

Answer 2

我回应@MarcoPashkov的评论，但会尝试详细说明LibSVM文件格式。我发现文档很全面但很难找到，对于Python lib我推荐README on GitHub。

要识别的一个重要部分是存在稀疏格式，其中0的所有要素都被删除，并且不删除0的要素的密集格式。这两个是从README中获得的每个等效示例。

# Dense data
>>> y, x = [1,-1], [[1,0,1], [-1,0,-1]]
# Sparse data
>>> y, x = [1,-1], [{1:1, 3:1}, {1:-1,3:-1}]

y变量存储数据的所有类别的列表。

x变量存储特征向量。

assert len(y) == len(x), "Both lists should be the same length"

Heart Scale Example中的格式是稀疏格式，其中字典键是特征索引，字典值是特征值，而第一个值是类别。

使用Bag of Words Representation作为特征向量时，稀疏格式非常有用。

由于大多数文档通常会使用语料库中使用的单词的一小部分，因此生成的矩阵将具有许多零（通常超过99％）的特征值。

例如，10,000个短文本文档（例如电子邮件）的集合将使用总数为100,000个唯一单词的词汇表，而每个文档将单独使用100到1000个唯一单词。

对于使用您开始使用的特征向量的示例，我训练了一个基本的LibSVM 3.20模型。此代码并非意图使用，但可能有助于展示如何创建和测试模型。

from collections import namedtuple
# Using namedtuples for descriptive purposes, in actual code a normal tuple would work fine.
Category = namedtuple("Category", ["index", "name"])
Feature = namedtuple("Feature", ["category_index", "distance_from_beginning", "distance_from_end", "contains_digit", "capitalized"])

# Separate up the set of categories, libsvm requires a numerical index so we associate each with an index.
categories = dict()
for index, name in enumerate("B M C S NA".split(' ')):
    # LibSVM expects index to start at 1, not 0.
    categories[name] = Category(index + 1, name)
categories

Out[0]: {'B': Category(index=1, name='B'),
   'C': Category(index=3, name='C'),
   'M': Category(index=2, name='M'),
   'NA': Category(index=5, name='NA'),
   'S': Category(index=4, name='S')}

# Faked set of CSV input for example purposes.
csv_input_lines = """category_index,distance_from_beginning,distance_from_end,contains_digit,capitalized
B,1,10,1,0
M,10,1,0,1
C,2,3,0,1
S,23,2,0,0
NA,12,0,0,1""".split("\n")
# We just ignore the header.
header = csv_input_lines[0]

# A list of Feature namedtuples, this will be trained as lists.
features = list()
for line in csv_input_lines[1:]:
    split_values = line.split(',')
    # Create a Feature with the values converted to integers.
    features.append(Feature(categories[split_values[0]].index, *map(int, split_values[1:])))

features

Out[1]: [Feature(category_index=1, distance_from_beginning=1, distance_from_end=10, contains_digit=1, capitalized=0),
 Feature(category_index=2, distance_from_beginning=10, distance_from_end=1, contains_digit=0, capitalized=1),
 Feature(category_index=3, distance_from_beginning=2, distance_from_end=3, contains_digit=0, capitalized=1),
 Feature(category_index=4, distance_from_beginning=23, distance_from_end=2, contains_digit=0, capitalized=0),
 Feature(category_index=5, distance_from_beginning=12, distance_from_end=0, contains_digit=0, capitalized=1)]

# Y is the category index used in training for each Feature. Now it is an array (order important) of all the trained indexes.
y = map(lambda f: f.category_index, features)
# X is the feature vector, for this we convert all the named tuple's values except the category which is at index 0.
x = map(lambda f: list(f)[1:], features)

from svmutil import svm_parameter, svm_problem, svm_train, svm_predict
# Barebones defaults for SVM
param = svm_parameter()
# The (Y,X) parameters should be the train dataset.
prob = svm_problem(y, x)
model=svm_train(prob, param)

# For actual accuracy checking, the (Y,X) parameters should be the test dataset.
p_labels, p_acc, p_vals = svm_predict(y, x, model)

Out[3]: Accuracy = 100% (5/5) (classification)

我希望这个例子有所帮助，它不应该用于你的训练。这只是一个例子，因为它效率低下。

使用libsvm功能的例子在Python中支持向量机

2 个答案: