我目前正在编写代码以从csv文件中提取常用单词,并且在我列出列出的陌生单词的小图之前,它的工作效果很好。我不知道为什么,可能是因为其中包含一些外来词。但是,我不知道该如何解决。
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer,
TfidfVectorizer
from sklearn.model_selection import train_test_split, KFold
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import matplotlib
from matplotlib import pyplot as plt
import sys
sys.setrecursionlimit(100000)
# import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
data = pd.read_csv("C:\\Users\\Administrator\\Desktop\\nlp_dataset\\commitment.csv", encoding='cp1252',na_values=" NaN")
data.shape
data['text'] = data.fillna({'text':'none'})
def remove_punctuation(text):
'' 'a function for removing punctuation'''
import string
#replacing the punctuations with no space,
#which in effect deletes the punctuation marks
translator = str.maketrans('', '', string.punctuation)
#return the text stripped of punctuation marks
return text.translate(translator)
#Apply the function to each examples
data['text'] = data['text'].apply(remove_punctuation)
data.head(10)
#Removing stopwords -- extract the stopwords
#extracting the stopwords from nltk library
sw= stopwords.words('english')
#displaying the stopwords
np.array(sw)
# function to remove stopwords
def stopwords(text):
'''a function for removing stopwords'''
#removing the stop words and lowercasing the selected words
text = [word.lower() for word in text.split() if word.lower() not in sw]
#joining the list of words with space separator
return " ". join(text)
# Apply the function to each examples
data['text'] = data ['text'].apply(stopwords)
data.head(10)
# Top words before stemming
# create a count vectorizer object
count_vectorizer = CountVectorizer()
# fit the count vectorizer using the text dta
count_vectorizer.fit(data['text'])
# collect the vocabulary items used in the vectorizer
dictionary = count_vectorizer.vocabulary_.items()
#store the vocab and counts in a pandas dataframe
vocab = []
count = []
#iterate through each vocav and count append the value to designated lists
for key, value in dictionary:
vocab.append(key)
count.append(value)
#store the count in pandas dataframe with vocab as indedx
vocab_bef_stem = pd.Series(count, index=vocab)
#sort the dataframe
vocab_bef_stem = vocab_bef_stem.sort_values(ascending = False)
# Bar plot of top words before stemming
top_vocab = vocab_bef_stem.head(20)
top_vocab.plot(kind = 'barh', figsize=(5,10), xlim = (1000, 5000))
我想要一个按条形图排列的常用单词列表,但是现在它只给出频率完全相同的非英语单词。请帮帮我
答案 0 :(得分:0)
问题是您没有按计数对词汇进行排序,而没有按计数矢量化程序创建的某些唯一ID进行排序。
count_vectorizer.vocabulary_.items()
这不包含每个功能的数量。 count_vectorizer不会保存每个功能的数量。
因此,您将从情节中的语料库中看到最稀有/拼写错误的单词(因为这些单词会带来更大的变化,即更大的价值-唯一ID)。获取单词计数的方法是对文本数据进行转换,然后对所有文档中每个单词的计数求和。
默认情况下,tf-idf会删除标点符号,并且您还可以输入停用词列表,以使矢量化程序将其删除。您的代码可以减少如下。
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document ?',
]
sw= stopwords.words('english')
count_vectorizer = CountVectorizer(stop_words=sw)
X = count_vectorizer.fit_transform(corpus)
vocab = pd.Series( X.toarray().sum(axis=0), index = count_vectorizer.get_feature_names())
vocab.sort_values(ascending=False).plot.bar(figsize=(5,5), xlim = (0, 7))
插入您的文本数据列,而不是corpus
。以上代码段的输出将为