使用SVM为整个文档提供单个标记

时间:2015-04-06 12:28:27

标签: machine-learning nlp svm text-classification

我想知道,如何训练SVM,将整个文档作为输入,并为该输入文档提供单个标签。 到目前为止,我只是逐字逐句地标记。 例如,输入文档可能包含6到10个句子,整个文档将被标记为单个类进行培训。

1 个答案:

答案 0 :(得分:1)

基本方法如下:

  1. 创建培训文档和标签/类的列表。
  2. 标记您的培训文件。
  3. 从文档中删除停用词。
  4. 为您的文档创建TF-IDF值。
  5. 将您的TF-IDF值限制为N个最常见的值,例如N = 1000。
  6. 使用有限的TF-IDF数据和标签训练SVM。
  7. 然后你有一个分类器,可以将TF-IDF格式的文档映射到类标签。因此,您可以在将测试文档转换为类似的TF-IDF表格后对其进行分类。

    以下是Python中使用scikit进行SVM的示例,该文档将文档分类为关于狐狸或城市:

    from sklearn import svm
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    # Training examples (already tokenized, 6x fox and 6x city)
    docs_train = [
        "The fox jumped over the fence .",
        "The fox sleeps under the tree .",
        "A fox walks through the high grass .",
        "Didn 't see a single fox today .",
        "I saw a fox yesterday near the lake .",
        "You might encounter foxes at the lake .",
    
        "New York City is full of skyscrapers .",
        "Los Angeles is a city on the west coast .",
        "I 've been to Los Angeles before .",
        "Let 's travel to Mexico City .",
        "There are no skyscrapers in Washington .",
        "Washington is a beautiful city ."
    ]
    
    # Test examples (already tokenized, 2x fox and 2x city)
    docs_test = [
        "There 's a fox in the garden .",
        "Did you see the fox next to the tree ?",
        "What 's the shortest way to Los Alamos ?",
        "Traffic in New York is a pain"
    ]
    
    # Labels of training examples (6x fox and 6x city)
    y_train = ["fox", "fox", "fox", "fox", "fox", "fox",
               "city", "city", "city", "city", "city", "city"]
    
    # Convert training and test examples to TFIDF
    # The vectorizer also removes stopwords and converts the texts to lowercase.
    vectorizer = TfidfVectorizer(max_df=1.0, max_features=10000,
                                 min_df=0, stop_words='english')
    
    vectorizer.fit(docs_train + docs_test)
    
    X_train = vectorizer.transform(docs_train)
    X_test = vectorizer.transform(docs_test)
    
    # Train an SVM on TFIDF data of the training documents
    clf = svm.SVC()
    clf.fit(X_train, y_train)
    
    # Test the SVM on TFIDF data of the test documents
    print clf.predict(X_test)
    

    输出符合预期(2x狐狸和2x城市):

    ['fox' 'fox' 'city' 'city']