Question

我有一些300k的推文，每个推文都没有标签或最多四个标签。例如： -

1.] "I really sci-fi documentaries and movies" ; ["science", "movies"]
2.] "The international politics scene is getting dirty"; ["politics"]
3.] "I dont know what to say"; [null]
4.] "I dont have any interest in national political debates on tv, I'd rather watch science shows like cosmos or sports like soccer, baseball; ["sports", "science", "politics"]

现在我一直在使用NaiveBayes，并且在培训期间（而不是多标签）只为每条推文使用了一个标签： -

    1.] "I really sci-fi documentaries and movies" ; ["science"]
    2.] "The international politics scene is getting dirty"; ["politics"]
    3.] "I dont know what to say"; [null]
    4.] "I dont have any interest in national political debates on tv, I'd rather watch science shows like cosmos or sports like soccer, baseball; ["politics"]

但是你可以看到我想要“多标签”分类，虽然我是从Naive-Bayes开始的，因为我可以找到一个非常棒的教程，我可以很容易地参考这些教程以便开始，但是我无处可寻找python教程，以迎合我的实际“多标签”问题。我所能找到的只是关于算法的研究论文或建议（KNN，Multinomial NB等）。有谁可以帮帮我。

Answer 1

你可以试试这个，但这还是天真的。

#Initialize a weight matrix of size NxM; N is number of classes and M number of features.
#label is a set of label(s) associated with a tweet.
for tweet, label in tweets
    #you have to write a feature extraction function
    features = extractFeatures(tweet)
    #write a simple predict function that implements arg max over dot product
    predictions = perceptron.predict(features)
    for each prediction in predictions:
        #use simple additive procedure to move your decision boundary by -1, +1.
        if prediction not in label:
           subtract the weights associated with prediction
           add the weights for the correct class(s) in label

多标签推文分类python nltk

1 个答案: