Question

我已经尝试过使用本机语言（日语）的文本数据来学习本教程。 https://marcotcr.github.io/lime/tutorials/Lime%20-%20multiclass.html

我基本上按照本教程进行操作，只是添加了分词器以将每个句子分成词汇。

我使用正则表达式跳过了一些单词，例如标记时包含数字的单词。

我的令牌生成器：

import MeCab

def mecabing(text):

    jisho = '-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd'
    mecab = MeCab.Tagger(jisho)

    mecab.parse('')
    select_conditions = ['名詞'] #selecting only nouns

    node= mecab.parseToNode(text)
    mecab_list = []
    while node:

        word = node.surface
        pos = node.feature.split(",")
        #print(word, pos)
        node = node.next
        #test3.append(TaggedDocument(word, which))
        if pos[0] in select_conditions and pos[1] != '非自立':
            if bool(re.search(r'[0-9]', word)) == False: #skip a word containing numbers
                mecab_list.append(word)
    return(mecab_list)

当我点击

vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(analyzer=mecabing)
train_vectors = vectorizer.fit_transform(df["text"].values.astype('U'))
print(vectorizer.vocabulary_)

我可以获得正确标记和向量化的单词列表，用于训练分类模型sklearn.naive_bayes.MultinominalNB。

我将此训练好的模型放入lime.lime_text.LimeTextExplainer，并显示了结果。

exp = explainer.explain_instance(df["text"][idx], NBmodel.predict_proba, num_features=6, labels=[0, 2])
print ('\n'.join(map(str, exp.as_list(label=0))))

但是结果显示的不仅仅是词汇化的词汇-有时包含一个句子。

例如，结果如下。

('ファミ通を発行しているエンターブレインと同じビル', 0.006904584676500223)
('ボーン30', 0.006423075274125261)
('平均1500ポリゴン', 0.006027007490568979)
('半蔵門', 0.005345011950591051)
('PSP', 0.005324184348436766)
('キャラモデル', 0.0052622796379719755)

最上面的是一个句子，应该用5个单词进行标记，并且有些单词本身包含数字-在vectorizer.vocabulary_中找不到该句子或带有数字的单词。

我不明白为什么会这样。任何提示将不胜感激。

LIME不使用矢量化单词来解释分类模型

0 个答案: