我正在尝试使用countvectorizer生成BiGrams并将它们附加回数据帧。 Howerver它只给了我unigrams作为输出。我想仅在存在特定关键字时才创建bi克。我正在使用词汇参数
传递它们我想要达到的目的是消除文本语料库中的其他单词,并使词汇表中的n-gram列表出现
输入数据
Id Name
1 Industrial Floor chenidsd 34
2 Industrial Floor room 345
3 Central District 46
4 Central Industrial District Bay
5 Chinese District Bay
6 Bay Chinese xrty
7 Industrial Floor chenidsd 34
8 Industrial Floor room 345
9 Central District 46
10 Central Industrial District Bay
11 Chinese District Bay
12 Bay Chinese dffefef
13 Industrial Floor chenidsd 34
14 Industrial Floor room 345
15 Central District 46
16 Central Industrial District Bay
17 Chinese District Bay
18 Bay Chinese grty
NLTK
words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
词汇定义
english_corpus=['bay','central','chinese','district', 'floor','industrial','room']
Bigram Generator
cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
for i, col in enumerate(cv.get_feature_names()):
Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)
然而它只给我一个unigram作为输出。如何解决这个问题。
输出
In[26]:Nata.columns.tolist()
Out[26]:
['Id',
'Name',
'bay',
'central',
'chinese',
'district',
'floor',
'industrial',
'room']
答案 0 :(得分:2)
from io import StringIO
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
stoplist = stopwords.words('english') + list(punctuation)
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), stop_words=stoplist)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
请参阅Basic NLP with NLTK以了解它是如何自动小写的," tokenize"并删除停用词。
[OUT]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
analyzer
参数import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
[OUT]:
['bay',
'bay chinese',
'central',
'central district',
'central industrial',
'chinese',
'chinese district',
'district',
'district bay',
'floor',
'floor room',
'industrial',
'industrial district',
'industrial floor',
'room']
您误解了CountVectorizer
中vocabulary
参数的含义。
来自文档:
vocabulary
:映射或可迭代,可选映射(例如,dict)在哪里 键是术语,值是特征矩阵中的索引,或者是 可迭代的术语。如果没有给出,则从中确定词汇表 输入文件。不应重复映射中的索引 不应该在0和最大指数之间有任何差距。
这意味着您只会将词汇表中的内容视为feature_name
。如果你的功能集中需要bigrams,那么你需要在你的词汇表中使用bigrams
它没有生成ngrams,然后检查ngrams是否只包含词汇表中的单词。
在代码中,您会看到如果您在词汇表中添加双字母组合,那么它们将显示在feature_names()
中:
from io import StringIO
from string import punctuation
import pandas as pd
from nltk.corpus import stopwords
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
english_corpus=['bay chinese','central district','chinese','district', 'floor','industrial','room']
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2),vocabulary=english_corpus)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()
[OUT]:
['bay chinese',
'central district',
'chinese',
'district',
'floor',
'industrial',
'room']
一种可能的解决方案:您必须使用ngram生成编写自己的分析器,并检查生成的ngrams是否在您要保留的单词列表中,例如。
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from io import StringIO
from string import punctuation
from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords
stoplist = stopwords.words('english') + list(punctuation)
def preprocess(text):
return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2)
if not any([word for word in ng if word in stoplist or word.isdigit()])
]
text = """Industrial Floor
Industrial Floor room
Central District
Central Industrial District Bay
Chinese District Bay
Bay Chinese
Industrial Floor
Industrial Floor room
Central District"""
df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])
vectorizer.get_feature_names()