Question

我想做n-grams方法，但要一个字母一个字母

正常N-gram：

sentence : He want to watch football match

result:
he, he want, want, want to , to , to watch , watch , watch football , football, football match, match

我想这样做，但要一个字母一个字母

word : Angela 

result:
a, an, n , ng , g , ge, e ,el, l , la ,a

这是我使用Sklearn的代码，但是它仍然是逐字而不是逐字母的：

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 100),token_pattern = r"(?u)\b\w+\b")

corpus = ['Angel','Angelica','John','Johnson']

X = vectorizer.fit_transform(corpus)
analyze = vectorizer.build_analyzer()
print(vectorizer.get_feature_names())
print(vectorizer.transform(['Angela']).toarray())

Answer 1

有一个'analyzer'参数可以满足您的需求。

根据the documentation：-

分析器：字符串，{'word'，'char'，'char_wb'}或可调用

该功能是否应该由单词或字符n-gram组成。选项“ char_wb”仅从单词边界内的文本创建字符n-gram；   单词边缘的n-gram用空格填充。

如果传递了可调用对象，则用于提取特征序列   原始，未处理的输入中。

默认情况下，它设置为单词，可以更改。

只需：

vectorizer = CountVectorizer(ngram_range=(1, 100),
                             token_pattern = r"(?u)\b\w+\b", 
                             analyzer='char')

sklearn中字母的N-gram

1 个答案: