在Python中使用scikit-learn库时,我可以使用CountVectorizer
创建所需长度的ngrams(例如2个单词),如下所示:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
myString = 'This is a\nmultiline string'
countVectorizer = CountVectorizer(ngram_range=(2,2))
analyzer = countVectorizer.build_analyzer()
listNgramQuery = analyzer(myString)
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print(NgamQueryWeights.items())
打印:
dict_items([('is multiline', 1), ('multiline string', 1), ('this is', 1)])
从正在创建的is multiline
ngram中可以看到(默认情况下,停用词a
被过滤掉),引擎不关心字符串中的换行符。
如何修改引擎创建ngrams以尊重字符串中的换行符,并且只创建包含属于同一行文本的所有单词的ngrams?我的预期输出是:
dict_items([('multiline string', 1), ('this is', 1)])
我知道我可以通过将token_pattern=someRegex
传递给CountVectorizer来修改tokenizer模式。此外,我在某处读到默认的正则表达式是u'(?u)\\b\\w\\w+\\b'
。尽管如此,我认为这个问题更多的是关于ngram的创建,而不是关于tokenizer的问题,因为问题不是在不尊重linebreak而是ngram的情况下创建令牌。
答案 0 :(得分:3)
您需要将分析仪重载为described in the documentation。
def bigrams_per_line(doc):
for ln in doc.split('\n'):
terms = re.findall(r'\w{2,}', ln)
for bigram in zip(terms, terms[1:]):
yield '%s %s' % bigram
cv = CountVectorizer(analyzer=bigrams_per_line)
cv.fit(['This is a\nmultiline string'])
print(cv.get_feature_names())
# ['This is', 'multiline string']
答案 1 :(得分:2)
接受的答案很好,但只能找到bigrams(由两个单词组成的标记)。为了将它概括为ngrams(就像我在问题中的示例代码中使用ngram_range=(min,max)
参数),可以使用以下代码:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
import re
from itertools import tee, islice
# custom ngram analyzer function, matching only ngrams that belong to the same line
def ngrams_per_line(doc):
# analyze each line of the input string seperately
for ln in doc.split('\n'):
# tokenize the input string (customize the regex as desired)
terms = re.findall(u'(?u)\\b\\w+\\b', ln)
# loop ngram creation for every number between min and max ngram length
for ngramLength in range(minNgramLength, maxNgramLength+1):
# find and return all ngrams
# for ngram in zip(*[terms[i:] for i in range(3)]): <-- solution without a generator (works the same but has higher memory usage)
for ngram in zip(*[islice(seq, i, len(terms)) for i, seq in enumerate(tee(terms, ngramLength))]): # <-- solution using a generator
ngram = ' '.join(ngram)
yield ngram
然后使用自定义分析器作为CountVectorizer的参数:
cv = CountVectorizer(analyzer=ngrams_per_line)
确保minNgramLength
和maxNgramLength
的定义方式使ngrams_per_line
函数知道它们(例如通过将它们声明为全局变量),因为它们无法传递给它作为论据(至少我不知道如何)。