使用Scikit-learn CountVectorizer仅为同一行上的单词创建ngrams(忽略换行符)

时间:2014-11-13 10:58:45

标签: python scikit-learn n-gram


from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

myString = 'This is a\nmultiline string'

countVectorizer = CountVectorizer(ngram_range=(2,2))
analyzer = countVectorizer.build_analyzer()

listNgramQuery = analyzer(myString)
NgramQueryWeights = nltk.FreqDist(listNgramQuery)



dict_items([('is multiline', 1), ('multiline string', 1), ('this is', 1)])

从正在创建的is multiline ngram中可以看到(默认情况下,停用词a被过滤掉),引擎不关心字符串中的换行符。


dict_items([('multiline string', 1), ('this is', 1)])


2 个答案:

答案 0 :(得分:3)

您需要将分析仪重载为described in the documentation

def bigrams_per_line(doc):
    for ln in doc.split('\n'):
        terms = re.findall(r'\w{2,}', ln)
        for bigram in zip(terms, terms[1:]):
            yield '%s %s' % bigram

cv = CountVectorizer(analyzer=bigrams_per_line)
cv.fit(['This is a\nmultiline string'])
# ['This is', 'multiline string']

答案 1 :(得分:2)


from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
import re
from itertools import tee, islice

# custom ngram analyzer function, matching only ngrams that belong to the same line
def ngrams_per_line(doc):

    # analyze each line of the input string seperately
    for ln in doc.split('\n'):

        # tokenize the input string (customize the regex as desired)
        terms = re.findall(u'(?u)\\b\\w+\\b', ln)

        # loop ngram creation for every number between min and max ngram length
        for ngramLength in range(minNgramLength, maxNgramLength+1):

            # find and return all ngrams
            # for ngram in zip(*[terms[i:] for i in range(3)]): <-- solution without a generator (works the same but has higher memory usage)
            for ngram in zip(*[islice(seq, i, len(terms)) for i, seq in enumerate(tee(terms, ngramLength))]): # <-- solution using a generator
                ngram = ' '.join(ngram)
                yield ngram


cv = CountVectorizer(analyzer=ngrams_per_line)
