我正在用Meka进行多标签分类。我使用的某些功能是通过scikit-learn提取的。使用Tfidf和Character N-Grams可以工作,但是如果我使用Word N-Grams,则分类器不会产生有意义的东西(将最后一个标签始终设置为true,其余标签设置为false)。我不知道为什么它不能与Word N-Grams一起使用,因为我以前使用scikit-learn进行了所有操作,并且代码并没有真正的不同:
# BOW with TF-IDF
def bowTfIdf(data):
# create Vectorizer
vectorizer = TfidfVectorizer()
# create Vocabulary
vectorizer.fit(data)
# create vector/matrix
featureVector = vectorizer.transform(data)
return featureVector
# N-Grams
def bowNGram(data, n):
#create Vectorizer
vectorizer = CountVectorizer(ngram_range=(n, n))
# create Vocabulary
vectorizer.fit(data)
# create vector/matrix
featureVector = vectorizer.transform(data)
return featureVector
# Character N-Grams
def bowCharacterNGram(data, n):
# create Vectorizer
vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(n, n))
# create Vocabulary
vectorizer.fit(data)
# create vector/matrix
featureVector = vectorizer.transform(data)
return featureVector
我还检查了词汇表和所有词汇表是否正确生成,情况似乎如此。
提取功能后,我创建了一个Arff(也带有标签),然后调用MEKA。为此的代码如下:
# X and Y both scipy.csr_matrix
def createArffandCallMEKA(X, Y, classifier, arff_path, classifier_path):
save_to_arff(X, Y, save_sparse=True, filename=arff_path, label_location="end")
j48 = "-W weka.classifiers.trees.J48 -- -C 0.25 -M 2"
randomTree = "-W weka.classifiers.trees.RandomTree -- -K 0 -M 1.0 -V 0.001 -S 1"
randomForest = "-W weka.classifiers.trees.RandomForest -- -I 100 -K 0 -S 1 -num-slots 1"
naiveBayes = "-W weka.classifiers.bayes.NaiveBayes"
# CHANGED TO 9.2
meka_command =r'java -cp "C:/Users/****/scikit_ml_learn_data/meka/meka-release-1.9.2/lib/*" meka.classifiers.multilabel.meta.CM -verbosity 6 -t "' + arff_path + r'" -R -d "' + classifier_path + r'" -I 10 -W meka.classifiers.multilabel.CC -- -S 0 '
if classifier == "J48":
meka_command += j48
elif classifier == "RandomTree":
meka_command += randomTree
elif classifier == "RandomForest":
meka_command += randomForest
elif classifier == "NaiveBayes":
meka_command += naiveBayes
pipes = subprocess.Popen(meka_command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
universal_newlines=True)
output, error = pipes.communicate()
if type(output) == bytes:
output = output.decode(sys.stdout.encoding)
if type(error) == bytes:
error = error.decode(sys.stdout.encoding)
if pipes.returncode != 0:
print(output + error)
else:
print(output)
X = bowNGram(data,2)
Y = csr_matrix(labels)
createArffandCallMEKA(X, Y, "J48", arff_path, "C:/Users/****/Desktop/Classifier" )
因此,这对于TFIDF(字符N-grams)非常有效,但对于Word N-grams则无效。也许是MEKA的问题?