Meka-使用Word N语法进行多标签分类

时间:2019-04-01 16:40:10

标签: python machine-learning scikit-learn weka n-gram

我正在用Meka进行多标签分类。我使用的某些功能是通过scikit-learn提取的。使用Tfidf和Character N-Grams可以工作,但是如果我使用Word N-Grams,则分类器不会产生有意义的东西(将最后一个标签始终设置为true,其余标签设置为false)。我不知道为什么它不能与Word N-Grams一起使用,因为我以前使用scikit-learn进行了所有操作,并且代码并没有真正的不同:

# BOW with TF-IDF
def bowTfIdf(data):

    # create Vectorizer
    vectorizer = TfidfVectorizer()

    # create Vocabulary
    vectorizer.fit(data)

    # create vector/matrix
    featureVector = vectorizer.transform(data)

    return featureVector

# N-Grams
def bowNGram(data, n):

    #create Vectorizer
    vectorizer = CountVectorizer(ngram_range=(n, n))

    # create Vocabulary
    vectorizer.fit(data)

    # create vector/matrix
    featureVector = vectorizer.transform(data)

    return featureVector


# Character N-Grams
def bowCharacterNGram(data, n):

    # create Vectorizer
    vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(n, n))

    # create Vocabulary
    vectorizer.fit(data)

    # create vector/matrix
    featureVector = vectorizer.transform(data)

    return featureVector

我还检查了词汇表和所有词汇表是否正确生成,情况似乎如此。
提取功能后,我创建了一个Arff(也带有标签),然后调用MEKA。为此的代码如下:

# X and Y both scipy.csr_matrix
def createArffandCallMEKA(X, Y, classifier, arff_path, classifier_path):
    save_to_arff(X, Y, save_sparse=True, filename=arff_path, label_location="end")
    j48 = "-W weka.classifiers.trees.J48 -- -C 0.25 -M 2"
    randomTree = "-W weka.classifiers.trees.RandomTree -- -K 0 -M 1.0 -V 0.001 -S 1"
    randomForest = "-W weka.classifiers.trees.RandomForest -- -I 100 -K 0 -S 1 -num-slots 1"
    naiveBayes = "-W weka.classifiers.bayes.NaiveBayes"
    # CHANGED TO 9.2
    meka_command =r'java -cp "C:/Users/****/scikit_ml_learn_data/meka/meka-release-1.9.2/lib/*" meka.classifiers.multilabel.meta.CM -verbosity 6 -t "' + arff_path + r'" -R -d "' + classifier_path + r'" -I 10 -W meka.classifiers.multilabel.CC -- -S 0  '
    if classifier == "J48":
        meka_command += j48
    elif classifier == "RandomTree":
        meka_command += randomTree
    elif classifier == "RandomForest":
        meka_command += randomForest
    elif classifier == "NaiveBayes":
        meka_command += naiveBayes
    pipes = subprocess.Popen(meka_command,
                             stdout=subprocess.PIPE,
                             stderr=subprocess.PIPE,
                             universal_newlines=True)
    output, error = pipes.communicate()
    if type(output) == bytes:
        output = output.decode(sys.stdout.encoding)
    if type(error) == bytes:
        error = error.decode(sys.stdout.encoding)
    if pipes.returncode != 0:
        print(output + error)
    else:
        print(output)

X = bowNGram(data,2)
Y = csr_matrix(labels)
createArffandCallMEKA(X, Y, "J48", arff_path, "C:/Users/****/Desktop/Classifier" )

因此,这对于TFIDF(字符N-grams)非常有效,但对于Word N-grams则无效。也许是MEKA的问题?

0 个答案:

没有答案