使用scikit-learn返回文档中的术语位置

时间:2015-10-09 07:36:31

标签: python scikit-learn

我知道scikit-learn遵循documentation的词袋假设/模型。但是,有没有办法在计算tf-idf时提取术语位置?

例如,如果我有这些文件

document1 = "foo bar baz"
document2 = "bar bar baz"

我可以以某种方式得到这个(term_id的元组/列表)

document1_terms = (1, 2, 3)
document2_terms = (2, 2, 3)

或(术语词典,以位置元组为值)

document1_terms = {1: (1, ), 2: (2, ), 3: (3, )}
document2_terms = {2: (1, 2), 3: (3, )}

2 个答案:

答案 0 :(得分:1)

经过一些试验和错误,我找到了解决这个问题的方法。首先创建发布

vectorizer = CountVectorizer()

term_doc_freq = vectorizer.fit_transform(collection['document'])

然后使用this

表示每个文档的term-id元组
from functools import partial
def document_get_position(row, vectorizer):
    result = tuple()

    for token in vectorizer.build_tokenizer()(row['document']):
        result = result + (vectorizer.vocabulary_.get(token),)

    return result

positions = collection.apply(partial(document_get_position,
                                     vectorizer=vectorizer),
                             axis=1)

答案 1 :(得分:0)

你是说这个吗?

In [13]: from sklearn.feature_extraction.text import CountVectorizer

In [14]: vectorize = CountVectorizer(min_df=1)

In [15]: document1 = "foo bar baz"
    ...: document2 = "bar bar baz dee"
    ...: 

In [16]: documents = [document1, document2]

In [17]: d = vectorize.fit_transform(documents)

In [18]: vectorize.vocabulary_
Out[18]: {u'bar': 0, u'baz': 1, u'dee': 2, u'foo': 3}

In [19]: d.todense()
Out[19]: 
matrix([[1, 1, 0, 1],
        [2, 1, 1, 0]], dtype=int64)