使用sklearn TfidfVectorizer和已经标记化的输入?

时间:2018-02-07 18:53:35

标签: scikit-learn tfidfvectorizer

我有一个标记化的句子列表,并希望适合一个tfidf Vectorizer。我尝试了以下方法:

tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]

def identity_tokenizer(text):
  return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')    
tfidf.fit_transform(tokenized_list_of_sentences)

出错为

AttributeError: 'list' object has no attribute 'lower'

有没有办法做到这一点?我有十亿句话,不想再次对它们进行标记。在此之前的另一个阶段,它们被标记化。

3 个答案:

答案 0 :(得分:6)

尝试使用参数TfidfVectorizer初始化lowercase=False对象(假设这实际上是您需要的,因为您已经在之前的阶段降低了您的令牌)。

tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]

def identity_tokenizer(text):
    return text

tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)    
tfidf.fit_transform(tokenized_list_of_sentences)

请注意,我更改了句子,因为它们显然只包含因词汇空白而导致其他错误的停用词。

答案 1 :(得分:1)

尝试使用preprocessor代替tokenizer

    return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'

如果上述错误消息中的x是一个列表,那么对列表执行x.lower()将会抛出错误。

你的两个例子都是停用词,所以为了让这个例子返回一些东西,扔几个随机的单词。这是一个例子:

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(preprocessor=' '.join, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

返回:

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>

特点:

>>> tfidf.get_feature_names()
['cat', 'dog']

更新:也许在tokenizer和预处理器上使用lambda

tokenized_sentences = [['this', 'is', 'one', 'cat', 'or', 'dog'],
                       ['this', 'is', 'another', 'dog']]

tfidf = TfidfVectorizer(tokenizer=lambda x: x,
                        preprocessor=lambda x: x, stop_words='english')
tfidf.fit_transform(tokenized_sentences)

<2x2 sparse matrix of type '<class 'numpy.float64'>'
    with 3 stored elements in Compressed Sparse Row format>
>>> tfidf.get_feature_names()
['cat', 'dog']

答案 2 :(得分:0)

@Jarad said一样,只需为分析器使用“通过”功能,但它需要忽略停用词。您可以从sklearn获取停用词:

>>> from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

或来自nltk

>>> import nltk
>>> nltk.download('stopwords')
>>> from nltk.corpus import stopwords
>>> stop_words = set(stopwords.words('english'))

或组合两个集合:

stop_words = stop_words.union(ENGLISH_STOP_WORDS)

但是您的示例仅包含停用词(因为所有单词都在sklearn.ENGLISH_STOP_WORDS集中)。

尽管如此,@ Jarad的示例仍然有效:

>>> tokenized_list_of_sentences =  [
...     ['this', 'is', 'one', 'cat', 'or', 'dog'],
...     ['this', 'is', 'another', 'dog']]
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf = TfidfVectorizer(analyzer=lambda x:[w for w in x if w not in stop_words])
>>> tfidf_vectors = tfidf.fit_transform(tokenized_list_of_sentences)

我喜欢pd.DataFrame来浏览TF-IDF向量:

>>> import pandas as pd
>>> pd.DataFrame(tfidf_vectors.todense(), columns=tfidf.vocabulary_)
        cat       dog 
0  0.814802  0.579739
1  0.000000  1.000000