Question

我想用dataset on Kaggle创建一个词云。但是，我在WordCloud上遇到问题，出现错误：没有足够的内存来计算词云。

我拥有的代码：

# Loading The Data
filename = "../input/us-accidents/US_Accidents_June20.csv"
df = pd.read_csv(filename)

# Import package and it's set of stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Create stopword list
stopwords = set(STOPWORDS) 
stopwords.update(["due",'accident'])

# Combine all description into one big text
df['Description']=str(df['Description'])
text = ' '.join(df['Description'])

# Create and generate a word cloud image:
wordcloud = WordCloud(
    background_color='white',
    max_font_size=50, 
    max_words=50,
    stopwords=stopwords
).generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

是因为“文本”太大而无法处理吗？还有其他方法可以将描述组合成一个大文本，以便word_cloud能够处理它？</ p>

Answer 1

这就是我要做的。正如您在评论中所说，您的语料库有2.5亿个字。对于您的机器来说，处理和创建wordcloud可能是巨大的。您应该尝试将数据最小化为最有价值的单词。一种想法是只保留高频词。

P.S。我尚未测试代码，因此，如果有任何语法错误，请原谅我。

filename = "../input/us-accidents/US_Accidents_June20.csv"
df = pd.read_csv(filename)

# Import package and it's set of stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Create stopword list
stopwords = set(STOPWORDS) 
stopwords.update(["due",'accident'])

# Create a variable with all the text from Description column
tokens = df['Description'].str.split().tolist()

# Create a counter for each work and it's frequency
from collections import Counter
Counter = Counter(tokens ) 

# Keep top X words with higher frequency
most_occur = Counter.most_common(1000) 

text = ' '.join([x[0] for x in most_occur])

# Create and generate a word cloud image:
wordcloud = WordCloud(
    background_color='white',
    max_font_size=50, 
    max_words=50,
    stopwords=stopwords
).generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

运行wordcloud时内存不足

1 个答案: