使用TF-IDF数组时出现MemoryError

时间:2018-12-08 05:52:45

标签: python-3.x out-of-memory

我正在使用 TfidfVectorizer 对数据进行矢量化。这是我的代码:

# read from txt file per line and split with whitespace
corpus = []
for line in open(address, 'r').readlines():
    corpus.append(line.strip())

# TF-IDF
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)
X = tfidf_matrix.toarray()

# generate the linkage matrix
Z = linkage(X, 'average')

# set cut-off to 1
max_d = 1  # max_d as in max_distance

fancy_dendrogram(
    Z,
    truncate_mode='lastp',
    p=50,
    leaf_rotation=90,
    leaf_font_size=10,
    show_contracted=True,
    annotate_above=1,  # useful in small plots so annotations don't overlap
    max_d = max_d,
)
plt.show()

但是我明白了:

  

回溯(最近通话最近):       文件“ C:/用户/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”,   第108行,在         tf_idf(source_file_addr)       文件“ C:/用户/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”,   第80行,在tf_idf中         Z =连锁度(X,'平均值')       文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ cluster \ hierarchy.py”,   708行,处于链接状态         y = distance.pdist(y,metric)       文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ spatial \ distance.py”,   1877行,在pdist中         dm = np.empty((m *(m-1))// 2,dtype = np.double)       MemoryError

如果我删除toarray(),我会得到:

  

回溯(最近通话最近):       文件“ C:/Users/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”,   第107行,在         tf_idf(source_file_addr)       文件“ C:/Users/Wesley/PycharmProjects/ProblemDetectyion/demo/data_process.py”,   第79行,在tf_idf中         Z =连锁度(X,'平均值')       文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ cluster \ hierarchy.py”,   694行,处于链接状态         y = _convert_to_double(np.asarray(y,order ='c'))       文件“ C:\ Users \ Wesley \ PycharmProjects \ ProblemDetectyion \ venv \ lib \ site-packages \ scipy \ cluster \ hierarchy.py”,   第1216行,在_convert_to_double中         X = X.astype(np.double)       ValueError:设置具有序列的数组元素。

我尝试使用:

HashingVectorizer(n_features=200, norm=None)

我开始犯错了:

ValueError: setting an array element with a sequence.

我只有8GB的RAM,我的文本文件包含100000行。

有人可以帮我解决这个问题吗?非常感谢。

0 个答案:

没有答案