的 TL; DR

Question

我正在尝试使用countvectorizer生成BiGrams并将它们附加回数据帧。 Howerver它只给了我unigrams作为输出。我想仅在存在特定关键字时才创建bi克。我正在使用词汇参数

传递它们

我想要达到的目的是消除文本语料库中的其他单词，并使词汇表中的n-gram列表出现

输入数据

 Id Name
    1   Industrial  Floor chenidsd 34
    2   Industrial  Floor room   345
    3   Central District    46
    4   Central Industrial District  Bay
    5   Chinese District Bay
    6   Bay Chinese xrty
    7   Industrial  Floor chenidsd 34
    8   Industrial  Floor room   345
    9   Central District    46
    10  Central Industrial District  Bay
    11  Chinese District Bay
    12  Bay Chinese dffefef
    13  Industrial  Floor chenidsd 34
    14  Industrial  Floor room   345
    15  Central District    46
    16  Central Industrial District  Bay
    17  Chinese District Bay
    18  Bay Chinese grty

NLTK

words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))

词汇定义

 english_corpus=['bay','central','chinese','district', 'floor','industrial','room']

Bigram Generator

 cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
    cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
    for i, col in enumerate(cv.get_feature_names()):
        Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

然而它只给我一个unigram作为输出。如何解决这个问题。

输出

In[26]:Nata.columns.tolist()
Out[26]:

['Id',
 'Name',
 'bay',
 'central',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

Answer 1

的 TL; DR

from io import StringIO from string import punctuation import pandas as pd from nltk.corpus import stopwords from sklearn.feature_extraction.text import CountVectorizer text = """Industrial Floor Industrial Floor room Central District Central Industrial District Bay Chinese District Bay Bay Chinese Industrial Floor Industrial Floor room Central District""" stoplist = stopwords.words('english') + list(punctuation) df = pd.read_csv(StringIO(text), sep='\t', names=['Text']) vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), stop_words=stoplist) vectorizer.fit_transform(df['Text']) vectorizer.get_feature_names()

请参阅Basic NLP with NLTK以了解它是如何自动小写的，＆＃34; tokenize＆＃34;并删除停用词。

[OUT]：

['bay', 'bay chinese', 'central', 'central district', 'central industrial', 'chinese', 'chinese district', 'district', 'district bay', 'floor', 'floor room', 'industrial', 'industrial district', 'industrial floor', 'room']

如果ngramization处于预处理步骤，只需覆盖analyzer参数

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from io import StringIO from string import punctuation from nltk import ngrams from nltk import word_tokenize from nltk.corpus import stopwords stoplist = stopwords.words('english') + list(punctuation) def preprocess(text): return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) if not any([word for word in ng if word in stoplist or word.isdigit()]) ] text = """Industrial Floor Industrial Floor room Central District Central Industrial District Bay Chinese District Bay Bay Chinese Industrial Floor Industrial Floor room Central District""" df = pd.read_csv(StringIO(text), sep='\t', names=['Text']) vectorizer = CountVectorizer(analyzer=preprocess) vectorizer.fit_transform(df['Text']) vectorizer.get_feature_names()

[OUT]：

['bay', 'bay chinese', 'central', 'central district', 'central industrial', 'chinese', 'chinese district', 'district', 'district bay', 'floor', 'floor room', 'industrial', 'industrial district', 'industrial floor', 'room']

您误解了CountVectorizer中vocabulary参数的含义。

来自文档：


vocabulary：

映射或可迭代，可选映射（例如，dict）在哪里   键是术语，值是特征矩阵中的索引，或者是   可迭代的术语。如果没有给出，则从中确定词汇表   输入文件。不应重复映射中的索引   不应该在0和最大指数之间有任何差距。

这意味着您只会将词汇表中的内容视为feature_name。如果你的功能集中需要bigrams，那么你需要在你的词汇表中使用bigrams

它没有生成ngrams，然后检查ngrams是否只包含词汇表中的单词。

在代码中，您会看到如果您在词汇表中添加双字母组合，那么它们将显示在feature_names()中：

from io import StringIO from string import punctuation import pandas as pd from nltk.corpus import stopwords text = """Industrial Floor Industrial Floor room Central District Central Industrial District Bay Chinese District Bay Bay Chinese Industrial Floor Industrial Floor room Central District""" english_corpus=['bay chinese','central district','chinese','district', 'floor','industrial','room'] df = pd.read_csv(StringIO(text), sep='\t', names=['Text']) vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2),vocabulary=english_corpus) vectorizer.fit_transform(df['Text']) vectorizer.get_feature_names()

[OUT]：

['bay chinese', 'central district', 'chinese', 'district', 'floor', 'industrial', 'room']

那么如何根据单个单词列表（unigrams）获取我的功能名称中的bigrams？

一种可能的解决方案：您必须使用ngram生成编写自己的分析器，并检查生成的ngrams是否在您要保留的单词列表中，例如。

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from io import StringIO from string import punctuation from nltk import ngrams from nltk import word_tokenize from nltk.corpus import stopwords stoplist = stopwords.words('english') + list(punctuation) def preprocess(text): return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) if not any([word for word in ng if word in stoplist or word.isdigit()]) ] text = """Industrial Floor Industrial Floor room Central District Central Industrial District Bay Chinese District Bay Bay Chinese Industrial Floor Industrial Floor room Central District""" df = pd.read_csv(StringIO(text), sep='\t', names=['Text']) vectorizer = CountVectorizer(analyzer=preprocess) vectorizer.fit_transform(df['Text']) vectorizer.get_feature_names()

在Countvectorizer中使用词汇表参数时未生成Bi-Grams

1 个答案:

的 TL; DR

如果ngramization处于预处理步骤，只需覆盖`analyzer`参数

那么如何根据单个单词列表（unigrams）获取我的功能名称中的bigrams？

在Countvectorizer中使用词汇表参数时未生成Bi-Grams

1 个答案:

的 TL; DR

如果ngramization处于预处理步骤，只需覆盖analyzer参数

那么如何根据单个单词列表（unigrams）获取我的功能名称中的bigrams？

如果ngramization处于预处理步骤，只需覆盖`analyzer`参数