运行wordcloud时内存不足

时间:2020-08-24 08:53:54

标签: python pandas word-cloud

我想用dataset on Kaggle创建一个词云。但是,我在WordCloud上遇到问题,出现错误:没有足够的内存来计算词云。

我拥有的代码:

# Loading The Data
filename = "../input/us-accidents/US_Accidents_June20.csv"
df = pd.read_csv(filename)
# Import package and it's set of stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Create stopword list
stopwords = set(STOPWORDS) 
stopwords.update(["due",'accident'])
# Combine all description into one big text
df['Description']=str(df['Description'])
text = ' '.join(df['Description'])
# Create and generate a word cloud image:
wordcloud = WordCloud(
    background_color='white',
    max_font_size=50, 
    max_words=50,
    stopwords=stopwords
).generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

是因为“文本”太大而无法处理吗?还有其他方法可以将描述组合成一个大文本,以便word_cloud能够处理它?<​​/ p>

1 个答案:

答案 0 :(得分:0)

这就是我要做的。正如您在评论中所说,您的语料库有2.5亿个字。对于您的机器来说,处理和创建wordcloud可能是巨大的。您应该尝试将数据最小化为最有价值的单词。一种想法是只保留高频词。

P.S。我尚未测试代码,因此,如果有任何语法错误,请原谅我。

filename = "../input/us-accidents/US_Accidents_June20.csv"
df = pd.read_csv(filename)

# Import package and it's set of stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Create stopword list
stopwords = set(STOPWORDS) 
stopwords.update(["due",'accident'])

# Create a variable with all the text from Description column
tokens = df['Description'].str.split().tolist()

# Create a counter for each work and it's frequency
from collections import Counter
Counter = Counter(tokens ) 

# Keep top X words with higher frequency
most_occur = Counter.most_common(1000) 

text = ' '.join([x[0] for x in most_occur])

# Create and generate a word cloud image:
wordcloud = WordCloud(
    background_color='white',
    max_font_size=50, 
    max_words=50,
    stopwords=stopwords
).generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()