在Countvectorizer中使用词汇表参数时未生成Bi-Grams

时间:2017-12-13 08:18:20

标签: python pandas scikit-learn nltk

我正在尝试使用countvectorizer生成BiGrams并将它们附加回数据帧。 Howerver它只给了我unigrams作为输出。我想仅在存在特定关键字时才创建bi克。我正在使用词汇参数

传递它们

我想要达到的目的是消除文本语料库中的其他单词,并使词汇表中的n-gram列表出现

输入数据

 Id Name
    1   Industrial  Floor chenidsd 34
    2   Industrial  Floor room   345
    3   Central District    46
    4   Central Industrial District  Bay
    5   Chinese District Bay
    6   Bay Chinese xrty
    7   Industrial  Floor chenidsd 34
    8   Industrial  Floor room   345
    9   Central District    46
    10  Central Industrial District  Bay
    11  Chinese District Bay
    12  Bay Chinese dffefef
    13  Industrial  Floor chenidsd 34
    14  Industrial  Floor room   345
    15  Central District    46
    16  Central Industrial District  Bay
    17  Chinese District Bay
    18  Bay Chinese grty

NLTK

words=nltk.corpus.stopwords.words('english')
Nata['Clean_Name'] = Nata['Name'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if  not  item.isdigit()]))
Nata['Clean_Name']=Nata['Clean_Name'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
Nata['Clean_Name'] = Nata['Clean_Name'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))

词汇定义

 english_corpus=['bay','central','chinese','district', 'floor','industrial','room']  

Bigram Generator

 cv = CountVectorizer( max_features = 200,analyzer='word',vocabulary = english_corpus,ngram_range =(2,2))
    cv_addr = cv.fit_transform(Nata.pop('Clean_Name'))
    for i, col in enumerate(cv.get_feature_names()):
        Nata[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

然而它只给我一个unigram作为输出。如何解决这个问题。

输出

In[26]:Nata.columns.tolist()
Out[26]:

['Id',
 'Name',
 'bay',
 'central',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

1 个答案:

答案 0 :(得分:2)

TL; DR

from io import StringIO
from string import punctuation

import pandas as pd
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

stoplist = stopwords.words('english') + list(punctuation)

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2), stop_words=stoplist)
vectorizer.fit_transform(df['Text'])

vectorizer.get_feature_names()

请参阅Basic NLP with NLTK以了解它是如何自动小写的," tokenize"并删除停用词。

[OUT]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']

如果ngramization处于预处理步骤,只需覆盖analyzer参数

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()

[OUT]:

['bay',
 'bay chinese',
 'central',
 'central district',
 'central industrial',
 'chinese',
 'chinese district',
 'district',
 'district bay',
 'floor',
 'floor room',
 'industrial',
 'industrial district',
 'industrial floor',
 'room']

您误解了CountVectorizervocabulary参数的含义。

来自文档:

  

vocabulary

     

映射或可迭代,可选映射(例如,dict)在哪里   键是术语,值是特征矩阵中的索引,或者是   可迭代的术语。如果没有给出,则从中确定词汇表   输入文件。不应重复映射中的索引   不应该在0和最大指数之间有任何差距。

这意味着您只会将词汇表中的内容视为feature_name。如果你的功能集中需要bigrams,那么你需要在你的词汇表中使用bigrams

它没有生成ngrams,然后检查ngrams是否只包含词汇表中的单词。

在代码中,您会看到如果您在词汇表中添加双字母组合,那么它们将显示在feature_names()中:

from io import StringIO
from string import punctuation

import pandas as pd
from nltk.corpus import stopwords

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

english_corpus=['bay chinese','central district','chinese','district', 'floor','industrial','room']

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2),vocabulary=english_corpus)
vectorizer.fit_transform(df['Text'])

vectorizer.get_feature_names()

[OUT]:

['bay chinese',
 'central district',
 'chinese',
 'district',
 'floor',
 'industrial',
 'room']

那么如何根据单个单词列表(unigrams)获取我的功能名称中的bigrams?

一种可能的解决方案:您必须使用ngram生成编写自己的分析器,并检查生成的ngrams是否在您要保留的单词列表中,例如。

import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer


from io import StringIO
from string import punctuation

from nltk import ngrams
from nltk import word_tokenize
from nltk.corpus import stopwords

stoplist = stopwords.words('english') + list(punctuation)

def preprocess(text):
    return [' '.join(ng) for ng in ngrams(word_tokenize(text.lower()),2) 
            if not any([word for word in ng if word in stoplist or word.isdigit()])
           ]

text = """Industrial  Floor
Industrial  Floor room
Central District
Central Industrial District  Bay
Chinese District Bay
Bay Chinese
Industrial  Floor
Industrial  Floor room
Central District"""

df = pd.read_csv(StringIO(text), sep='\t', names=['Text'])


vectorizer = CountVectorizer(analyzer=preprocess)
vectorizer.fit_transform(df['Text'])    
vectorizer.get_feature_names()