Python CountVectorizer:文档中术语的存在

时间:2017-07-29 23:58:06

标签: python scikit-learn lda countvectorizer

我正在使用Python进行LDA分析。 有没有开箱即用的方法来获取我的语料库(文本字符串列表)中有多少文本(编辑:n个单词的术语)?

@titipata的答案给出了单词频率:How to extract word frequency from document-term matrix?

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
texts = ['hey you', 'you ah ah ah']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
freq = np.ravel(X.sum(axis=0))

import operator
# get vocabulary keys, sorted by value
vocab = [v[0] for v in sorted(vectorizer.vocabulary_.items(),     key=operator.itemgetter(1))]
fdist = dict(zip(vocab, freq)) # return same format as nltk

单词频率在这里:

fdist
{u'ah': 3, u'you': 2, u'hey': 1}

但我想要

presence
{u'ah': 1, u'you': 2, u'hey': 1}

编辑:这也适用于N字的术语,您可以定义

我可以如下计算我想要的东西,但是从CountVectorizer那里有更快的方法吗?

presence={}
for w in vocab:
    pres=0
    for t in texts:
        pres+=w in set(t.split())
    presence[w]=pres

编辑:我刚为写作而写的内容不适用于N个单词的术语。这有效但很慢:

counter = Counter()
for t in texts:
    for term in vectorizer.get_feature_names():
        counter.update({term: term in t})

1 个答案:

答案 0 :(得分:2)

如果您的语料库不是太大,这应该可以很好地运行。此外,它依赖于python in-builts。请参阅Counter的文档。

<bean id="shoppingCart" 
   class="com.xxxxx.xxxx.ShoppingCartBean" scope="session">
   <aop:scoped-proxy/> 
</bean>

返回:

from collections import Counter

corpus = ['hey you', 'you ah ah ah']
sents = []

for sent in corpus:
    sents.extend(list(set(sent.split())))   # Use set et to ensure single count

Counter(sents)