Question

我想在文本段落的语料库中找到各种列入黑名单的术语。每个术语长约1-5个字，并包含我在文档语料库中不需要的某些关键字。如果在语料库中识别出与其类似的术语或类似内容，我希望将其从我的语料库中删除。

除了删除，我正在努力在我的语料库中准确识别这些术语。我正在使用scikit-learn并尝试了两种单独的方法：

使用tf-idf向量功能的多项式NB分类方法，混合使用黑名单术语和用作训练数据的干净术语。
OneClassSVM方法，其中只有被列入黑名单的关键字用作培训数据，任何传入的文本似乎与列入黑名单的条款不相似，都被视为异常值。

以下是我的OnceClassSVm方法的代码：

df = pd.read_csv("keyword_training_blacklist.csv")

keywords_list = df['Keyword']

pipeline = Pipeline([
    ('vect', CountVectorizer(analyzer='char_wb', max_df=0.75, min_df=1, ngram_range=(1, 5))),
    # strings to token integer counts
    ('tfidf', TfidfTransformer(use_idf=False, norm='l2')),  # integer counts to weighted TF-IDF scores
    ('clf', OneClassSVM(nu=0.1, kernel="rbf", gamma=0.1)),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

kf = KFold(len(keywords_list), 8)
for train_index, test_index in kf:
    # make training and testing datasets
    X_train, X_test = keywords_list[train_index], keywords_list[test_index]

    pipeline.fit(X_train)  # Train classifier using training data and labels
    predicted = pipeline.predict(X_test)
    print(predicted[predicted == 1].size / predicted.size)

csv_df = pd.read_csv("corpus.csv")

testCorpus = csv_df['Terms']

testCorpus = testCorpus.drop_duplicates()


for s in testCorpus:
    if pipeline.predict([s])[0] == 1:
        print(s)

在实践中，当我尝试将语料库传递给算法时，我得到了许多误报。我列入黑名单的学期训练数据约为3000学期。我的训练数据的大小是否需要进一步增加，或者我是否遗漏了一些明显的东西？

Answer 1

尝试使用difflib确定语料库中与您列出的每个黑名单最匹配的匹配项。

import difflib
from nltk.util import ngrams

words = corpus.split(' ') # split corpus to words based on spaces ( can be improved )

words_ngrams = [] # ngrams from 1 to 5 words
for n in range(1,6):
    words_ngrams.extend( ' '.join(ngrams(words, n)) )


to_delete = [] # will contain tuples (index, length) of matched terms to delete from corpus.
sim_rate = 0.8 # similarity rate
max_matches = 4 # maximum number of matches for each term
for term in terms:
    matches = difflib.get_close_matches(term,words_ngrams,n=max_matches,cutoff=sim_rate)
    for match in matches:
        to_delete.append( (corpus.index(match), len(match) ) )

如果你想在术语和ngram之间得到相似度，你也可以使用difflib.SequenceMatcher。

如何使用scikit准确地对具有大量潜在价值的文本进行分类？

1 个答案: