我希望TfidfVectorizer
的特征化考虑一些预定义的单词,例如"script", "rule",
仅用于双字母组。
如果我有文字"Script include is a script that has rule which has a business rule"
上述文字(如果我使用的话)
tfidf = TfidfVectorizer(ngram_range=(1,2),stop_words='english')
我应该得到
['script include','business rule','include','business']
答案 0 :(得分:2)
from sklearn.feature_extraction import text
# Given a vocabulary returns a filtered vocab which
# contain only tokens in include_list and which are
# not stop words
def filter_vocab(full_vocab, include_list):
b_list = list()
for x in full_vocab:
add = False
for t in x.split():
if t in text.ENGLISH_STOP_WORDS:
add = False
break
if t in include_list:
add = True
if add:
b_list.append(x)
return b_list
# Get all the ngrams (one can also use nltk.util.ngram)
ngrams = TfidfVectorizer(ngram_range=(1,2), norm=None, smooth_idf=False, use_idf=False)
X = ngrams.fit_transform(["Script include is a script that has rule which has a business rule"])
full_vocab = ngrams.get_feature_names()
# filter the full ngram based vocab
filtered_v = filter_vocab(full_vocab,["include", "business"])
# Get tfidf using the new filtere vocab
vectorizer = TfidfVectorizer(ngram_range=(1,2), vocabulary=filtered_v)
X = vectorizer.fit_transform(["Script include is a script that has rule which has a business rule"])
v = vectorizer.get_feature_names()
print (v)
注释了代码以解释其作用
答案 1 :(得分:0)
基本上,您正在根据您的特殊单词(在函数中将其称为interested_words
)自定义n_grams创建。我已为您定制了默认的n_grams creation function。
def custom_word_ngrams(tokens, stop_words=None, interested_words=None):
"""Turn tokens into a sequence of n-grams after stop words filtering"""
original_tokens = tokens
stop_wrds_inds = np.where(np.isin(tokens,stop_words))[0]
intersted_wrds_inds = np.where(np.isin(tokens,interested_words))[0]
tokens = [w for w in tokens if w not in stop_words+interested_words]
n_original_tokens = len(original_tokens)
# bind method outside of loop to reduce overhead
tokens_append = tokens.append
space_join = " ".join
for i in xrange(n_original_tokens - 1):
if not any(np.isin(stop_wrds_inds, [i,i+1])):
tokens_append(space_join(original_tokens[i: i + 2]))
return tokens
现在,我们可以将此功能插入TfidfVectorizer的常用analyzer中,如下所示!
import numpy as np
from sklearn.externals.six.moves import xrange
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.feature_extraction import text
def analyzer():
base_vect = CountVectorizer()
stop_words = list(text.ENGLISH_STOP_WORDS)
preprocess = base_vect.build_preprocessor()
tokenize = base_vect.build_tokenizer()
return lambda doc: custom_word_ngrams(
tokenize(preprocess(base_vect.decode(doc))), stop_words, ['script', 'rule'])
#feed your special words list here
vectorizer = TfidfVectorizer(analyzer=analyzer())
vectorizer.fit(["Script include is a script that has rule which has a business rule"])
vectorizer.get_feature_names()
[“业务”,“业务规则”,“包含”,“脚本包含”]
答案 2 :(得分:-1)
TfidfVectorizer
允许您提供自己的令牌生成器,您可以执行以下操作。但是您将失去词汇中的其他单词信息。
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["Script include is a script that has rule which has a business rule"]
vectorizer = TfidfVectorizer(ngram_range=(1,2),tokenizer=lambda corpus: [ "script", "rule"],stop_words='english')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())